Quality Analysis in Data Mining Projects
“Cruising the Data Ocean” Blog Series - Part 6 of 6
This blog is a part of our Chief Architect's "Cruising the Data Ocean" series. It offers a deep-dive into some essential data mining tools and techniques for harvesting content from the Internet and turning it into significant business insights.
In my previous posts, I discussed how to identify, acquire, cleanse, and extract meaning from Internet content and use it to build your business applications. But how do you ensure that your system always returns the highest-quality results? This is where quality analysis plays an essential role in your web data mining project.
Errors are introduced in many different ways when processing content from the Internet, and so quality analysis must be planned into your project from the very beginning. To do most quality analysis, you will need to check two parameters:
- Do you have everything? This is the “completeness” or “coverage” check.
- Is what you have correct? This is the “accuracy” check.
Quality Analysis Techniques
Some techniques for doing quality analysis in data mining include:
- Checking counts – check that record counts are accurate
- Check counts against the original website or other third parties where possible. Do not check counts against Internet search engine results.
- Check counts from run to run
- Check for counts that are too small or too large
- Validation checks – check metadata values for existence and accuracy by looking for:
- Nulls or zero-length strings
- Values out of range (for integers or floats)
- Strings that are too small or too large
- Strings that should match regular expressions
- Coordinated values (e.g. check metadata fields against each other when values are correlated)
- Distribution analysis – check metadata value distributions and look for oddities
- Perform histograms on metadata values
- Check for spikes, weird non-normal distributions, discontinuities, or unexplained gaps
- Randomly sample rich subsets
- Use grep or a search engine to find records which are likely to include patterns of interest. For example, if you are trying to download people and their ages (see example patterns in part 4 of this series), find all records with “age” or “years old” or “birthday” in them.
- Randomly sample the results and manually check that extraction was performed correctly
- Random sampling
- Randomly sample all records
- Manually check each record from the sample for correct download and extraction
Quality Analysis Goals
- Completeness of bulk download from the Internet
- Do you have all of the records? - count all records and compare to counts provided by the website(s), if available. Do not compare to Internet search engine counts for the site(s). Internet search engine counts are all, without exception, grossly wrong. Google’s counts are especially bad.
- Are all of the records complete? - look for zero byte or small-sized downloads. Check that </html> end tags exist for all records (where appropriate). If possible, compare the HTTP headers “Content-Length” reported size against the actual size of the records downloaded.
- Completeness of incremental download from the Internet
- Performing incremental updates is especially fraught with peril.
- You’ll want to make sure that you download all files since the last scan, especially including files which were created during the last scan.
- Compare the results of incremental updates with a streaming view of updates, if possible
- Manually check incremental updates against updates from the website, if possible (if, for example, they have a search engine on the site with a “sort by date” function)
- Accuracy and completeness of tagged metadata extraction
- Check that metadata extracted from HTML tagged content (or other contextual metadata, for example, from HTML parent pages) is accurate and complete
- This is best done with validation checks and distribution analysis.
- Accuracy and completeness of basic linguistic processing
- Check that tokenization and token processing are working correctly
- Check most frequent and least frequent token lists for anomalies
- Check largest and smallest tokens for anomalies
- Random sample documents and check token processing
- Accuracy and completeness of entity extraction
- Check histograms of extracted entities for anomalies
- Perform searches for entity parts (like a part of a name, or a part of a company). Randomly sample the results of this rich subset and see if the entities are extracted correctly
- Randomly sample a set of documents and check for correct entity extraction
- Accuracy and completeness of categorization
- Evaluate results against pre-tagged content (typically a percentage of the training set)
- Check histograms of categories for anomalies and sensible distribution. Compare histograms against the training set
- Randomly sample a set of documents and check for correct categorization
- Perform an “all pairs similarity” test against a subset of content. Identify similar pairs which are categorized into different categories. Check a random sample of this rich subset
- Accuracy and completeness of natural language processing extraction
- Perform all of the metadata tests described above: validation checks, distribution analysis, random sample of rich subsets, and random sampling
- Compare results against manually extracted content, if available
Summary and Conclusions
It’s important to understand that this blog series aims to provide a complete, end-to-end view of how Internet content can be acquired, harvested, and processed for internal use.
I’m sure that this all seems daunting and complex, but it doesn’t have to be. It all depends on your requirements and use cases. For example:
- Doing simple metadata extraction from structured content AND the number of sources and content types is low AND the sources and content types are fairly regular: can be implemented quickly - 1 to 2 months
- Doing a macro understanding AND your tolerance for incorrect results is high (for example, with a human in the loop): can be implemented quickly - 1 to 2 months
- Doing a micro understanding AND the number of facts or relationship types is small AND the entities required are modest and fairly regular: can be implemented in a reasonable time - 3 to 4 months
Further, APIs and toolkits for web data mining are improving all the time (especially recently), so expect to see further reduction in development time as the technology improves.
I first studied Natural Language Processing in 1986. My professor at the time and I then went on to form our own search engine company, “ConQuest” – for “Concept Questing.” So you could say that Natural Language Processing is in my blood.
Recently, I’ve started noticing more and more use cases of customers wanting to download content from the Internet and leverage it for their own use. In many of these cases, some sort of Natural Language Processing (Macro or Micro) was required.
And now I’m seeing these technologies come together in a powerful way. Natural Language Processing, which has always been something of a “sleeper technology,” has gotten a huge boost recently from personal digital assistants (Alexa, Siri, Google Home, etc.). So many of our customers are now asking: “Why can’t I have that for my company?”
And the answer is: You can!
Just be aware of how many different types of processing are possible and how requirements drive architecture and technology decisions. And in this way, it is definitely possible to create a working, robust system for acquiring, harvesting, and turning content from the Internet into insightful knowledge.
Search Technologies has been helping customers extract and process content from the Internet for multiple use cases, some of which were discussed in part 1 of this series. If you’re looking to analyze, size, estimate, and design your web data mining application or your Natural Language Processing system, we have knowledgeable architects with extensive search and NLP experience who are ready to help. Connect with us to discuss your use case.