Back to top

Cleansing and Formatting Content for Data Mining Projects

"Cruising the Data Ocean" Blog Series - Part 3 of 6

Paul Nelson
Paul Nelson
Innovation Lead

This blog is a part of our Chief Architect's "Cruising the Data Ocean" series. It offers a deep-dive into some essential data mining tools and techniques for harvesting content from the Internet and turning it into significant business insights.

In the first and second parts of this blog series, I discussed how to identify and acquire content from various Internet sources for your data mining needs. In this third blog, I'll provide an overview of some common techniques and tools for data cleansing and formatting. Raw data preparation for data mining projects includes:

  • Determine the format (e.g. PDF, XML, HTML, etc.)
  • Extract text content
  • Identify and remove useless sections, such as common headers, footers, and sidebars as well as legal or commercial boilerplates
  • Identify differences and changes
  • Extract coded metadata

Approaches to Cleansing and Formatting Data from the Internet

data-cleansing-techniques.jpgThere are several different approaches to cleansing and formatting raw data, each with advantages and disadvantages.

Approach 1: Use screen scrapers and/or browser automation tools (discussed in Part 2 of this blog series)

Advantages: extracts metadata from complex structures

Disadvantages: does not work at large scale or with a large variety of content and typically requires software programming

Approach 2: Use text extractors like Apache Tika or Oracle Outside In

Advantages: works on all types of files and formats

Disadvantages: does not extract much metadata (title, description, author) and may not extract content structure (headings, paragraphs, tables, etc.)

Approach 3: Custom coding based on the format, such as XML SAX parser, Beautiful Soup for HTML, and Aspose for other formats

Advantages: most power and flexibility

Disadvantages: most expensive to implement and custom coding is required

Additional Tools

These additional tools can work in conjunction with the basic cleansing and extraction methods above.

Common paragraph removal

  • Identifies common, frequently occurring paragraphs so they can be automatically removed


Structure mapping patterns

  • These are large, structural patterns which are easy to describe. They are applied to input documents to extract and map metadata.
  • Patterns can be XML, HTML, or text patterns.


Optical Character Recognition (OCR)

  • OCR systems extract text from images, so the text can be further processed by machines.
  • There are some open source engines (e.g. Tesseract and OCRopus) as well as some good commercial options (e.g. Abbyy and AquaForest).


Determining what tools to use depends on the type of content being ingested and how much metadata can be extracted from the content structure. 

If most of the content is in the web page structure (in tables, for example), then more coding-intensive methods will be required (screen scrapers, browser automation, or structure mapping patterns).

On the other hand, if most of the content is unstructured natural language text, then a text extractor can pull out the content. Depending on your requirements, the extracted content may then need to be further processed with Natural Language Processing (NLP), which I’ll discuss in-depth in the next part of this blog series. 

- Paul

> Continue to Part 4: Natural Language Processing Techniques