Back to top

Data Mining Tools and Techniques for Harvesting Data from the Internet

“Cruising the Data Ocean” Blog Series - Part 1 of 6

Paul Nelson
Paul Nelson
Innovation Lead

internet-of-things.jpg“Maybe we can use data from the Internet?”

Have you ever said that sentence? In my recent experience, this sentence is coming up more and more. After all, the Internet has so much incredible information, if only it could be downloaded and processed – just think of how valuable it could be?

Web data mining is a growing field which can provide powerful insights to help drive sales, understand customers, meet mission goals, and create new business opportunities.

In this blog series, I’ll be discussing multiple use cases as well as essential data mining tools and techniques for harvesting Internet data to support business analytics and intelligence. I’ll cover how to:

In this first part of the series, let’s have a high-level look at some business use cases for extracting web data and how to identify the right data for your needs from the “data ocean.”

Use Cases for Extracting Data from the Web

At Search Technologies, we’ve been helping customers extract data from the Internet almost since our company was founded. And the use cases are endless. Here are some examples:

Learn more about your customer

  • What is the CEO of your customer’s company saying?
  • What are your customer’s financial situation and key initiatives?
  • What are your customers tweeting and posting about recently?

Learn more about your competitors

  • What are your competitors doing?
  • What are they selling?
  • Are they doing anything new? Unique?

Find new customers and sales targets

  • What’s happening in the world?
  • Where should you target your sales?

Learn more about the government

  • What rules and regulations affect your company?
  • What is the government thinking about doing that might affect your business?
  • What are available grants and business opportunities?

Find things that are being sold and the people who sell them

  • To compare prices
  • To look for new business opportunities
  • To look for illegal activities and things which should not be sold

Supplement your internal offerings with external content

  • So users can “stay inside” your offering without having to consult external databases

Translate between external language and internal language

  • Often, the words and phrases used by your community are different than the ones used inside your company
  • Consulting external sources can help “translate” between external language and internal language

Watch what people are saying about you

  • Identify and mitigate potential customer issues before they go viral
  • Track the effectiveness of your ad campaigns 
  • Track product and brand activity and sentiment

Web Data Mining and Analytics Projects at Search Technologies

Some example customer projects with which Search Technologies has been involved include:

  • Scan through news articles to identify what companies are planning to expand or relocate - for a US state business development group
  • Use 10K, shareholder meetings, and annual reports to learn key company initiatives - for an industry association and a top-five consulting firm
  • Search for people selling illegal pets - for an agriculture department enforcement agency
  • Gather mining jobs and mining articles for a vertically-focused web search - for a mining publisher
  • Read through government rules and regulations - for an industry lobbying organization
  • Read through speeches and public statements for a candidate for office - for a political party
  • Download required textbooks for university classes to know what books their students will need - for a textbook rental company
  • Download construction rules and regulations - for a large construction company
  • Identify and track conferences with location and organization - for a conference support company

If you have any questions about these use cases or are looking to implement your own web data mining initiatives, contact us for further discussions. 

How to Identify Useful Data Sources on the Web

For most of us, it’s impractical to download all the data on the web. Therefore, you must first identify the data sources you want to target. Data, of course, covers a very wide range of quality, volume, applicability, and accessibility.



- These are websites willing to sell you their data.

- All have APIs for searching, filtering, and downloading content. Their available data includes news stories (from large and small news organizations around the world, both global and local), company reports, annual reports, financial filings, worldwide patents, marketing and market reports, corporate communications and so on.

  • Niche websites: Stack Exchange (e.g. Stack Overflow which has data dumps and an API), Github, and others
  • Coding: these are often good starting points for content analysis.

- Industry coding: NAICS (North American Industry Classification System) and SIC (Standard Industry Classification)

- Job coding – Standard Occupational Classification (SOC)

  • The World Wide Web:

- Of course, you can manually identify web pages (“seed URLs”) for a crawler to crawl.

- Or, you can get a set of websites from a search engine, for example, Bing or Google Custom Search (note that there is a cost for more than 100 or so searches per day). The websites returned by these search engines can then be crawled with a web crawler.

- Finally, you can also get seed URLs from other data sets, such as Wikidata, Twitter, and Reddit.

Once you’ve identified the sources from which you need the data, the next step would be using available data mining tools and techniques for acquiring the content effectively. I’ll discuss this step in the next part of my blog series. Read on!

- Paul

> Continue to Part 2: Acquire the Data