Data Mining Tools and Techniques for Harvesting Data from the Internet
“Cruising the Data Ocean” Blog Series - Part 1 of 6
“Maybe we can use data from the Internet?”
Have you ever said that sentence? In my recent experience, this sentence is coming up more and more. After all, the Internet has so much incredible information, if only it could be downloaded and processed – just think of how valuable it could be?
Web data mining is a growing field which can provide powerful insights to help drive sales, understand customers, meet mission goals, and create new business opportunities.
In this blog series, I’ll be discussing multiple use cases as well as essential data mining tools and techniques for harvesting Internet data to support business analytics and intelligence. I’ll cover how to:
- Part 1: Identify the data
- Part 2: Acquire the data
- Part 3: Cleanse and format the data
- Part 4: Understand the data in its natural language
- Part 5: Work with the results
- Part 6: Do quality analysis
In this first part of the series, let’s have a high-level look at some business use cases for extracting web data and how to identify the right data for your needs from the “data ocean.”
Use Cases for Extracting Data from the Web
At Search Technologies, we’ve been helping customers extract data from the Internet almost since our company was founded. And the use cases are endless. Here are some examples:
Learn more about your customer
- What is the CEO of your customer’s company saying?
- What are your customer’s financial situation and key initiatives?
- What are your customers tweeting and posting about recently?
Learn more about your competitors
- What are your competitors doing?
- What are they selling?
- Are they doing anything new? Unique?
Find new customers and sales targets
- What’s happening in the world?
- Where should you target your sales?
Learn more about the government
- What rules and regulations affect your company?
- What is the government thinking about doing that might affect your business?
- What are available grants and business opportunities?
Find things that are being sold and the people who sell them
- To compare prices
- To look for new business opportunities
- To look for illegal activities and things which should not be sold
Supplement your internal offerings with external content
- So users can “stay inside” your offering without having to consult external databases
Translate between external language and internal language
- Often, the words and phrases used by your community are different than the ones used inside your company
- Consulting external sources can help “translate” between external language and internal language
Watch what people are saying about you
- Identify and mitigate potential customer issues before they go viral
- Track the effectiveness of your ad campaigns
- Track product and brand activity and sentiment
Web Data Mining and Analytics Projects at Search Technologies
Some example customer projects with which Search Technologies has been involved include:
- Scan through news articles to identify what companies are planning to expand or relocate - for a US state business development group
- Use 10K, shareholder meetings, and annual reports to learn key company initiatives - for an industry association and a top-five consulting firm
- Search for people selling illegal pets - for an agriculture department enforcement agency
- Gather mining jobs and mining articles for a vertically-focused web search - for a mining publisher
- Read through government rules and regulations - for an industry lobbying organization
- Read through speeches and public statements for a candidate for office - for a political party
- Download required textbooks for university classes to know what books their students will need - for a textbook rental company
- Download construction rules and regulations - for a large construction company
- Identify and track conferences with location and organization - for a conference support company
If you have any questions about these use cases or are looking to implement your own web data mining initiatives, contact us for further discussions.
How to Identify Useful Data Sources on the Web
For most of us, it’s impractical to download all the data on the web. Therefore, you must first identify the data sources you want to target. Data, of course, covers a very wide range of quality, volume, applicability, and accessibility.
- Curated public sources: Wikipedia (available in convenient XML dump files – see Search Technologies' Azure Search demonstration using Wikipedia), Wikidata, and Wiktionary
- Social media sources: Twitter, Facebook, Reddit, Instagram, Pinterest, Google+
- Government data: US Government Publishing Office, United States Code, Data.gov
- Medical and health: Medline, MESH, CPT and ICD codes
- Company content: websites can be crawled with web crawlers (Wikidata is a good “jumping off point” to find website addresses), also AnnualReports.com and EDGAR
- Third-party aggregators: Thomson Reuters, Factiva, NewsCred, and LexisNexis
- These are websites willing to sell you their data.
- All have APIs for searching, filtering, and downloading content. Their available data includes news stories (from large and small news organizations around the world, both global and local), company reports, annual reports, financial filings, worldwide patents, marketing and market reports, corporate communications and so on.
- Niche websites: Stack Exchange (e.g. Stack Overflow which has data dumps and an API), Github, and others
- Coding: these are often good starting points for content analysis.
- Job coding – Standard Occupational Classification (SOC)
- The World Wide Web:
- Of course, you can manually identify web pages (“seed URLs”) for a crawler to crawl.
- Or, you can get a set of websites from a search engine, for example, Bing or Google Custom Search (note that there is a cost for more than 100 or so searches per day). The websites returned by these search engines can then be crawled with a web crawler.
- Finally, you can also get seed URLs from other data sets, such as Wikidata, Twitter, and Reddit.
Once you’ve identified the sources from which you need the data, the next step would be using available data mining tools and techniques for acquiring the content effectively. I’ll discuss this step in the next part of my blog series. Read on!