Textual ETL: A Key Component for Big Data Applications
- Traditional ETL operates on structured data, originally created by computers
- Structured data is highly consistent and predicatble in terms of format. Think log files, transaction records, etc.
- 80% of the world's data is unstructured (textual) or semi-structured in nature
- More to the point, much of this 80% was created by humans
- The variability of human-created content stands in huge contrast to the predictable, uniform nature of structured data
- Humans are insonsistent, emotional, complex, and quite simply unique, in their content creation behaviour
Bottom line: Traditional ETL methods don't work with textual content.
- If big data systems are to derive actionable insight from the unstructured world, textual ETL lies on the critical path
- It requires a different approach, and different technologies
Search Technologies provides consulting, services, and proven software tools, many of which are open-source, as the basis of efficient, textual ETL solutions.
Contact us for a no-commitments discussion of your textual ETL requirements and ideas.
Unstructured content is fundamentally different from structured data and must be treated appropriately. This involves specialist skills and technology
At Search Technologies, we've been implementing big data systems for more than five years. We provide Hadoop expertise at competitive daily rates
Staff Blog, Structuring the Unstructured, describing the crossover from enterprise search technology to the big data world.
A free-to-download white paper providing a foundational strategy for big data and unstructured content processing
The processing of unstructured content prior to indexing requires a different approach. Techniques typically used with structured content can't cope with the variability and unpredictability