click to search this site
 
 
 

Customers

A glossary of common data preparation tasks


Data preparation for search is perhaps the most commonly neglected aspect of building search systems, yet good data preparation can have a profound effect on the accuracy and relevancy of search.  This page provides a glossary for some of the more common processes used to prepare data for indexing into a search engine.


Packaging:  Ingesting randomly distributed data, such as hard drives full of documents, and assembling them into packages of related files.


Format conversion:  Normalizing file formats for processing and indexing purposes, typically to xml.


Text extraction:  Isolating the meaningful text from within a document.  For example, extracting the meaningful content of a web page whilst filtering out noise such as menu structures and buzz clouds.


Merging and Normalizing:  Merging data from multiple sources into a normalized format, adding data where it is missing.  These processes can help ensure that disparate information sources "play well together" during the search process.


Data Model Design (DMD):  A definition of the desired overall structure of content in terms of metadata fields, representation in the search index, and how content is presented in search results. The DMD is always unique to an application.


Parsing:  Automatically extracting information into fields from semi-structured documents.


Enrichment:  Gathering data from outside sources and including it within document records.  For example, getting data from classification structures, category trees, entity lists, gazetteers or authority files.


Publishing:  Exporting processed documents into a search engine index and/or other repositories.


Splitting:  Dividing larger files into more usefully (and more logically) searchable pieces. For example, records from CSV files, articles from magazines or chapters from books.

News

  • University of Louisville signs with Search Technologies
  • Search Technologies assists RightNow with their GSA implementation
  • The Office of the Law Revision Counsel extends search services agreement with Search Technologies
  • Search Technologies named in "100 Companies that matter in KM" for the fourth year
  • Dionex and TIBCO sign with Search Technologies