Monday 10 March 2014

Collecting Data With Web Scrapers

There is a large amount of data available only through websites. However, as many people have found out, trying to copy data into a usable database or spreadsheet directly out of a website can be a tiring process. Data entry from internet sources can quickly become cost prohibitive as the required hours add up. Clearly, an automated method for collating information from HTML-based sites can offer huge management cost savings.

Web scrapers are programs that are able to aggregate information from the internet. They are capable of navigating the web, assessing the contents of a site, and then pulling data points and placing them into a structured, working database or spreadsheet. Many companies and services will use programs to web scrape, such as comparing prices, performing online research, or tracking changes to online content.

Let's take a look at how web scrapers can aid data collection and management for a variety of purposes.

Improving On Manual Entry Methods

Using a computer's copy and paste function or simply typing text from a site is extremely inefficient and costly. Web scrapers are able to navigate through a series of websites, make decisions on what is important data, and then copy the info into a structured database, spreadsheet, or other program. Software packages include the ability to record macros by having a user perform a routine once and then have the computer remember and automate those actions. Every user can effectively act as their own programmer to expand the capabilities to process websites. These applications can also interface with databases in order to automatically manage information as it is pulled from a website.

Aggregating Information

There are a number of instances where material stored in websites can be manipulated and stored. For example, a clothing company that is looking to bring their line of apparel to retailers can go online for the contact information of retailers in their area and then present that information to sales personnel to generate leads. Many businesses can perform market research on prices and product availability by analyzing online catalogues.

Data Management

Managing figures and numbers is best done through spreadsheets and databases; however, information on a website formatted with HTML is not readily accessible for such purposes. While websites are excellent for displaying facts and figures, they fall short when they need to be analyzed, sorted, or otherwise manipulated. Ultimately, web scrapers are able to take the output that is intended for display to a person and change it to numbers that can be used by a computer. Furthermore, by automating this process with software applications and macros, entry costs are severely reduced.

This type of data management is also effective at merging different information sources. If a company were to purchase research or statistical information, it could be scraped in order to format the information into a database. This is also highly effective at taking a legacy system's contents and incorporating them into today's systems.

Overall, a web scraper is a cost effective user tool for data manipulation and management.

Source:http://ezinearticles.com/?Collecting-Data-With-Web-Scrapers&id=4223877

Tuesday 4 March 2014

Screen scraping: how to stop the internet's invisible data leeches

Data is your business's most valuable asset, so it's never a good idea to let it slip into the hands of competitors.

Sometimes, however, that can be difficult to prevent due to an automated technique known as 'screen scraping' that has for years provided a way of extracting data from website pages to be indexed over time.

This poses two main problems: first, that data could be used to gain a business advantage - from undercutting prices (in the case of a price comparison website, for example) to obtaining information on product availability.

Persistent scraping can also grind down a website's performance, which recently happened to LinkedIn when hackers used automated software to register thousands of fake accounts in a bid to extract and copy data from member profile pages.

Ashley Stephenson, CEO of Corero Network Security, explains the origins behind the phenomenon, how it could be affecting your business right now and how to defend from it.

TechRadar Pro: What is screen scraping? Can you talk us through some of the techniques, and why somebody would do it?

Ashley Stephenson: Screen scraping is a concept that was pioneered by early terminal emulation programs decades ago. It is a programmatic method to extract data from screens that are primarily designed to be viewed by humans.

Basically the screen scraping program pretends to be a human and "reads" the screen, collecting the interesting data into lists that can be processed automatically. The most common format is name:value pairs. For example, information extracted from a travel site reservation screen might look like the following -

Origin: Boston, Destination:Atlanta, Date:10/12/13, Flight:DL4431, Price:$650

Screen scraping has evolved significantly over the years. A major historical milestone occurred when the screen scraping concept was applied to the Internet and the web crawler was invented.

Web crawlers originally "read" or screen scraped website pages and indexed the information for future reference (e.g. search). This gave rise to the search engine industry. Today webcrawlers are much more sophisticated and websites include information (tags) dedicated to the crawler and never intended to be read by a human.

Another subsequent milestone in the evolution of screen scraping was the development of e-retail screen scraping, perhaps the most well know example being the introduction of price comparison websites.

These sites employ screen scraping programs to periodically visit a list of known e-retail sites to obtain the latest price and availability information for a specific set of products or services. This information is then stored in a database and used to provide aggregated comparative views of the e-retail landscape to interested customers.

In general the previously described screen scraping techniques have been welcomed by website operators who want their sites to be indexed by the leading search engines such as Google or Bing, similarly e-retailers typically want their products to be displayed on the leading comparison shopping sites.

TRP: Have there been any recent developments in competitive screen scraping?

AS: In contrast over the past few years, recent developments in competitive screen scraping are not necessarily so welcome. For a site to be scraped by a search engine crawler is OK if the crawler visits are infrequent.

For a site to be the target of a price comparison site scraper is OK if the information obtained is used fairly. However as the number of specialized search engines continues to increase and the frequency of price check visits skyrockets these automated page views can rise to levels which impact the intended operation of the target site.

More specifically, if the target site is the victim of competitive scraping the information obtained can be used to undermine the business of the site owner. For example, undercutting prices, beating odds, aggressively acquiring event tickets, reserving inventory, etc.

In general, we believe there is a significant increase in the use of automated bots to gather website content to seed other services, fuel competitive intelligence, and aggregate product details like pricing, features and inventory. Increasingly this information is used to get a leg up over the competition, or to increase website hit rates.

For example, in the travel and tourism industry, price scraping is a real issue as travel sites are constantly looking to beat out the competition by offering the 'best price'. Additionally, the idea of inventory scraping is becoming more common. The concept of bots being used to purchase volumes of a high value item to resell, or to increase online pricing to deter potential buyers.

With the high availability of seemingly legal software bundles and services to facilitate the screen scraping process, and the motives we've just described, it's really a pretty powerful combination.

TRP: How long has screen scraping been going on for and is it becoming more or less of a problem for companies?

AS: Screen scraping has been going on for years but it is only more recently that victims, negatively impacted by this type of behaviour, are beginning to react. Some claim copyright infringement and unfair business practices while in contrast, organizations doing the scraping claim freedom of information.

Many website owners have written usage policies on their sites that prohibit aggressive scraping but have no ability to enforce their policies - the problem doesn't seem to be going away anytime soon.

TRP: How does screen scraping impact negatively on a business's IT systems?

AS: Competitive or abusive Screen scraping is just another example of unwanted traffic. Recent studies show that 61% of Internet Traffic is generated by bots. Bad-bot scrapers consume valuable resources and bandwidth intended to serve genuine web site users, this can result in increased latency for real customers, due to large numbers of non-human visits to the site. The business impact manifests itself as additional IT investment needed to serve the same number of customers.

TRP: Ebay introduced an API years ago to combat screen scraping. Is creating an API to provide access to data a recommended form of defense?

AS: Providing a dedicated API allows "good" scrapers access to your data programmatically and voluntarily observes resource utilization limits however it does not stop malicious information harvesting to be used for competitive advantage.

Real defense can be obtained by taking advantage of technology that can identify and block unwanted non-human visitors to your website. This would allow real or 'good' users to access the site for their intended purposes, while blocking the bad crawlers and bots from causing damage.

TRP: How else can an organisation defend itself from screen scraping?

AS: Using techniques such as IP reputation intelligence, geolocation enforcement, spoofed IP source detection, real time threat-level assessment, request-response behaviour analysis and bi-directional deep packet inspection.

Many organizations today are relying on Corero's First Line of Defense technology block unwanted website traffic including excessive scraping. Corero helps identify human visitors vs. non-human bots (e.g. running scripts) and blocks the unwanted offenders real-time.

TRP: Are there any internet rules governing the use (or misuse) of screen scraping?

AS: Screen scraping has been the topic of some pretty high-profile lawsuits for example Craigslist vs. PadMapper, and in the travel space for example, Ryanair vs. Budget Travel.

However, most court cases to date have not been fully resolved to the satisfaction of the victims. The courts often refuse to grant injunctions for said activity most likely because they have no precedent to work with. This is primarily due to the fact that there few if any internet rules really governing this type of activity.

Source:http://www.techradar.com/news/internet/web/screen-scraping-how-to-stop-the-internet-s-invisible-data-leaches-1214404

Monday 3 March 2014

Internet Marketing - A Beginners Guide

Every new website owner is faced with the problem of getting visitors to their site, and if the website happens to be a commercial venture it needs to happen fast, even non commercial sites need visitors to survive. A quick look at some of the online tools to find expired domains show just how many websites fail every hour of every day. To help prevent your site joining the daily list of failures effective marketing is essential.

Marketing does not necessarily mean spending lots of money, however if fast results are needed then some money will have to be spent. As this article is aimed at beginners, I'll only briefly look at paid marketing and concentrate mainly on the free or low cost options to promote your site.

Paid Marketing:

In many respects online marketing is not dissimilar to offline marketing and many of the tactics used to promote an offline business will work equally well for websites. For example newspaper or magazine adverts although sometimes expensive can produce excellent results for websites, and occasionally magazines will supply cover cd's with links back to your site, an excellent way to get more visitors.

For larger campaigns television or radio ads can work, and with the expansion of satellite TV stations costs for this type of campaign are coming down all the time. For most new site owners however these options will prove to be too expensive and specialist online advertising will be the preferred choice. In the online paid sector directory listings and Pay Per Click (PPC) are probably the most popular. PPC quite simply work as the name describes and you pay a set amount every time a potential customer clicks your link. This has the advantage you only pay when someone visits your site and the obvious disadvantagethat it's open the click fraud where others click the links to increase your bill.

Most of the providers of PPC offer protection against click fraud and block multiple clicks from the same IP address, although for the site owner this is difficult to monitor, and suspicion is always there that the PPC provider has little incentive to police their policies as it would lead to loss of revenue. From experience PPC advertising can become very expensive and although it's quite easy to set up yourself through companies like google, it can sometimes be worth paying a specialist to run your campaign, as they will know how best to target the clicks to make best use of your money.

Free Marketing:

In business like all walks of life the general rule is that you get nothing for nothing, with Internet marketing this rule somewhat goes out the window as there are many free resources for promoting your website. In this section I'm going to take a look at a few of the most popular, this is of coarse only scraping the surface and as you develop your strategy further you'll find new outlets and tools that will get lots of visitors popping by your website.

Search Engines:

Without question the most effective free source of visitors are the many search engines like Google, Yahoo and MSN... the list really is almost endless. Most new website owners believe to get listed in these popular search engines the best thing to do is submit your site to them, this is not the case. All the main search engines these days use crawlers which automatically browse the web and store the contents of the sites they visit.

The search engines then use this content (along with 100's of other criteria ) to rank sites for specific search terms. Search engine optimization (SEO) is a whole different subject, however in brief what you need is links from other website, and some of the marketing tips below, along with attracting visitors to your site will help with your SEO efforts. The truth is that many people use website marketing solely to increase ranking in search engines, in my experience however I've found that if you promote your site for real visitors the search engines will follow.

There are many SEO companies who specialise in getting your site up the rankings, you do need to be careful as many will make great and exaggerated claims, some of which are simply not possible. The best advice I can give here is to research thoroughly and look at, and if possible contact past customers for a reference. Good SEO can be the most cost effective way to promote your site but you do need to work hard at it or employ someone else to do it for you.

Forums:

Internet forums can be very useful for getting visitors to your site, the biggest advantage with forums is that you can target forums related to your area of interest and most will allow members who contribute useful posts and replies to have a link back to their own website in their signature. You should take care not to just post links as most good forums will consider this as spam, and even if the moderators don't delete the link, visitor looking at it will clearly see what's going on and your sites reputation will suffer.

Blogs:

In my opinion every website owner should have a blog, and for SEO purposes it's probably best to have the blog on a different domain. There are lots of free blog sites like blogger.com which offer you easy to use blogs. The biggest advantage with your own blog is you can write articles and provide links to different areas of your site, this provides different entry points and is also very good SEO practice.

Another use of blogs is comments, a great way of getting visitors back to your site is to search for other blogs relevant to your website and leave comments with a link back to your site. As with forums care should be taken not to spam the comments as it's a bad practice and unlikely to help you long term.

Article Submissions:

If you're up to it, writing an article can be a good way to get links back to your website, most good article distribution sites like EzineArticles.com will allow you to have a short bio at the end of the article, which can direct visitors back to your websites. If possible the article should be about a subject related to your business, but as you can see from my bio below it's not essential :-)

Well that's about it, as I said at the beginning this is only an introduction for beginners and as your site and business grows you'll find many new and exciting ways to market your online business.

Source: http://ezinearticles.com/?Internet-Marketing---A-Beginners-Guide&id=1729653