Unstructured Content Processing Tasks
Content processing, prior to indexing, is the foundation of effective enterprise search systems and search-based applications. It is also important for business insight applications which seek to derive actionable intelligence through the analysis of unstructured data.
Content processing tasks can be broadly split into three areas.
Enrichment: The capture, addition and arrangement of metadata to drive user interface features such as search navigators, sorting by property, and graphical display of search results.
Normalization: Relevancy algorithms work better when they can compare "apples with apples." Enterprise search systems must deal with a wide variety of document types, lengths and formats, normalization is a important yet often overlooked task.
Cleansing: Many documents contain repetitive or misleading information, such as menu structures, or templated headers and footers. If not identified and removed (or at least mitigated), these can cause false-positives in search results, and will skew statistical analyses.
This section of the Search Technologies Web site, which is a work-in-progress, details some of the common techniques used for enrichment, normalization and cleansing.
CONTENT ENRICHMENT TASKS
- Using taxonomies
- Using seed documents
For further information, or implementation assistance with any of these processes, please contact us.