Metadata Extraction Architecture for Search
Search Technologies’ Document Preparation Methodology for Search (DPMS) can in some cases be implemented using a search engine's native indexing pipeline tools. These vary in quality, reliability and openness. Therefore Search Technologies maintains and supports a search product-independent tool kit to enable the efficient implementation of DPMS into any environment.
The following diagram shows the overall architecture for DPMS.
Each step in this process is described below:
- Packaging: Combining files from the data sources into logical units for downstream processing
- Document Pre Processing: Format conversions and text extraction as necessary to enable parsing and entity extraction
- Parsing & Extraction: Extracting metadata and semantic entities, and creating the metadata representation of the document. This can also involve splitting documents into pieces (called "granules") for better search representation
- Enrichment: Typically with the help of external resources such as classification tables and authority files (lists of companies, or members of Congress, social network ratings, etc.)
- Document Post Processing: For example, stamping metadata into HTML files for quick display during browsing
- Transform and Load: This stage maps metadata to search engine fields for indexing. This step will also:
- Determine what parts of each document are available to search, browse, or both
- Determine how content is distributed into relevancy bins – this influences how a document will be relevancy scored
- Provide internal to external metadata mapping
Flexibility and Transparency
Perhaps the most important hallmarks of a great DPMS architecture are flexibility and transparency.
Many search-based application projects need to cope with large databases containing wildly disparate document formats. Often, there are few "hard rules" that can be conveniently applied. For example, document titles: Titles rarely occur in the same place in all documents. Sometimes the tagging of titles will be different, the content format or encoding will be different, or a percentage of documents might provide no title at all. Such an eclectic combination can appear insurmoutable to standard programming approaches which depend on a full understanding of the data as an a-priori requirement to starting development. Further, standard programming techniques need to be constantly reworked and reprogrammed in such environments – causing an endless cycle of incremental software releases that sap development resources.
For these reasons, insufficiently flexible architectures such as Spring or Hibernate – essentially any architecture which requires source-code generation, recompilation or the redeployment of binary code – should be avoided. An emphasis must instead be placed on the ability to make changes to the processing flow very quickly, or even better, to allow "flow-through" of data and metadata structures without any custom processing or storage at all.
Preferred architectures therefore emphasize the "X" in XML – namely its extensibility, ease of validation, and ability to operate in the presence of missing elements. "Collection-specific" and "extension" areas within XML schemas allow for easy extension of metadata models. These flexible areas for metadata are loaded into search engines using techniques such as Scope Search (FAST ESP), dynamic fields (Google Search Appliance, Lucene), or custom "Between" operators (Solr/Lucene).
Similarly, extremely flexible results presentation can be achieved using on-demand, compiled scripting languages such as Groovy or Scala. These allow end-user presentation to be customized as new collections are added to the system without the need to redeploy any binary code libraries.
Relational Database Architectures
Fixed table designs in a relational database have been proven to be insufficiently flexible for most search engine document processing applications. Search Technologies has been involved with several systems which were originally based on "standard" architectural patterns, starting with the loading of all data into a Relational Database using ETL techniques.
While such architectures may work well with small, well understood document collections, they are insufficiently flexible or transparent for large scale search-based applications. Typical issues encountered with RDBMS-centric approaches include:
- Constant reworking of the RDBMS table as new collections are added to the system
- Constant redeployment of code as new documents or data-centric issues are encountered
- Frequent reloads of the entire collection to handle changing needs and requirements
- Inability to set aside subsets of documents for special processing
- Insufficient transparency about which documents are being processed successfully and which are not
Moving to the DPMS architecture has provided substantial reductions in both system deployment cost and processing performance improvements. For example:
- Replacing a custom CMS data model stored in Oracle with a "collection-neutral" model where most metadata remains in an outside XML file, has completely eliminated weeks of development time required to add each new collection of documents to the search application. As a consequence, both deployment frequency and expense were dramatically reduced.
- Replacing a custom ETL process for loading patent information with a record-at-a-time DPMS architecture, storing most metadata in XML outside the database, but with key metadata in XML blobs inside the database, reduced the time required to create a document loading infrastructure from 1.5 years down to 4 months. In addition, the resulting architecture performed 4 times faster than the original ETL solution, in terms of throughput.
The diagram provided at the top of this page is a simplified architectural overview showing a general framework for use with most document collections. Naturally, each installation requires a tailored process to handle the specific needs of the application’s data and user’s search aspirations.
As an example, the following diagram illustrates the component architecture implemented by Search Technologies for a search portal application, where documents are gathered by crawlers from the Internet and then enhanced through additional processing and enrichment metadata provided by the publisher. Note that the framework handles all of the component configuration, job routing, error and quarantined document management, and job and sub-job management required to handle the sophisticated data paths typically found in production systems.
Error Handling and Quarantine
Consistent and predictable error detection is a critical part of the document processing architecture. In most large, real-world databases, there will be a percentage of documents which are corrupt or variant. Such documents must be detected and quarantined.
Documents may be quarantined for any of a number of reasons, including improper format or encoding, accidental truncation, unknown metadata codes, invalid dates, missing elements, etc.
First and foremost, the quarantine is used to implement the necessary corrections to document processing to ensure that the maximum number of documents can be processed automatically. Second, the quarantine is used to build a regression test database to verify the operation of the system as corrections are implemented.
Finally, quarantined documents can also be handled manually:
- The document content can be updated as necessary to enable correct processing
- Corrupted metadata can be manually corrected
Typically, once the quarantine is sufficiently small, corrections are implemented manually to achieve 100% overall document processing success.