Back to top

How to Acquire Content from the Internet for Data Mining

"Cruising the Data Ocean" Blog Series - Part 2 of 6

This blog is a part of our Chief Architect's "Cruising the Data Ocean" series. It offers a deep-dive into some essential data mining tools and techniques for harvesting content from the Internet and turning it into significant business insights.


acquire-internet-content.jpgIn the first part of this blog series, I discussed how to identify the sources for your data mining needs. Once you've done that, you will need to fetch it and download it to your own computers so it can be processed. I'll cover this step here in the second part of the blog series. 

RECOMMENDATION: Download raw content and save the files.

  • Saved files can be reprocessed over and over to extract more data as you learn more about the content.
  • Old versions of files can be compared to newer versions to identify changing content. This is often a useful source of notifications.
  • Saved files can be stored cheaply in cloud storage.

But how do you acquire content from the Internet? There are, fundamentally, four techniques.


Technique 1: Crawlers

Crawlers are über-scalable machines for downloading lots and lots of pages.

The focus of crawlers is scalability and volume. They follow links from web pages around the Internet (or within a website) and download pages. They can be distributed across many machines to download tens of thousands of web pages.

Good crawlers include:

  • Heritrix – from the Open Internet Archive
  • Nutch – from Apache
  • Aspider – from Search Technologies. Our Aspider web crawler is an elastically-scalable, distributed crawler with excellent website authentication support (which allows you to crawl more content).

Technique 2: Scrapers

Scrapers focus on extracting content.

Scrapers are typically less scalable and more hand-tuned than crawlers and focus instead on extracting content (such as numeric and metadata information) from the web pages they download. When you need to extract structured data from web pages based on presentation structure, then a scraper may be the best choice.

Some common scrapers include:

  • Scrapy – a Python-based scraper and also has a hosted cloud-based version and a graphical tool to help create scrappers (Portia)
  • Octoparse – an MS-Windows scraper with visual tools to implement scraping
  • Apifier – a cloud-based JavaScript scraper
  • Content Grabber – a screen scraper with scripting, dynamic parameters, and ability to handle SSO cookies and proxies
  • UiPath – is more of a larger “automation framework” of which screen scraping is a component
  • And there are many others

Note that scrapers will be able to extract structured content where it is structured on the web page, in other words, based on HTML tagging, JSON structures, etc. While they require more work and programming than a crawler (which is simply “point and go”), the output is more structured and immediately useful.

Technique 3: Browser Automation

Browser automation retrieves and renders the page like a web browser.

Browser automation tools actually run the JavaScript pulled from the web pages and render the HTML (and other data structures). They can then be combined with custom scripting to explore the results and download content which might otherwise be inaccessible.

Some common browser automation tools include: 

Technique 4: Third-Party APIs

Third-party APIs will be required for third-party content providers.

If you intend to access data from third-party content providers, such as Thomson Reuters, LexisNexis, Bing, Factiva, NewsCred, etc., you will need to use the APIs they provide.

Fortunately, these providers have taken the time and effort to deliver good structured data, and so using these APIs will typically require a lot less time than using a scraper or browser automation tool.

General Guidelines

Generally, crawlers are used for low-complexity content extraction at very large volumes while scrapers and browser automation tools are used for lower numbers of pages with more complexity:

web-content-parsers-quadrant.jpg

Of course, the upper right-hand quadrant of this diagram is the holy-grail: extracting very large amounts of content from a large number of websites with a large number of pages. Such an extraction job can be expensive depending on the number of websites and the variety of the access methods.

Handling A Very, Very Large Number of Sources…

If you need to acquire content from a large number of data sources, you will likely need to develop your own data acquisition and ingestion tools.

  • At a minimum, you will need tools to manage the source catalog.
  • If your sources are fairly similar in structure, AND the content is not easily or accurately acquired by simple crawling, then custom acquisition tools may be the most effective.

After acquiring the data you need, the next step will be cleansing and formatting your raw data, so that you'll have the highest-quality content ready for data mining. I'll discuss data preparation tools and methods in the next part of this blog series. Read on!

- Paul

> Continue to Part 3: Cleanse & Format the Data

0