Document Preparation Frameworks and Tools
Search Technologies’ Document Preparation Methodology for Search (DPMS) provides an overall architecture and approach for enhancing enterprise search and analytics. We use a number of software tools which have been developed to speed the process of indexing and implementing new search-based applications. These tools have also proved equally valuable when retrofitted to existing search systems. This section describes the main tools typically used in a DPMS engagement.
High Performance Document Preparation Framework
Search Technologies has created a document processing framework which is available. license-fee-free, to our customers. It has the following characteristics:
- High Performance: Thread pools, queues, and job handling are all managed as part of the framework
- Componentized: All functionality is packaged into reusable components which allow for rapid reconfiguration of system functionality as needs change
- Pre-built components: Many components for common operations are available pre-built
- Scriptable: Scripting components can be used to quickly add new functionality without significant coding expense
- Configuration: The framework encompasses all configuration and environmental settings (server names, etc.)
- Dynamic Deployment: The framework is built around OSGi (see http://www.osgi.org/) enabling dynamic deployment of components and configurations which can be modified in live running systems without requiring shutdown or restart
- Transparent: All components are visible through a web-based administration UI, which also provides component testing, monitoring, and performance measurement functionality
This provides a well understood, reliable and well-tested framework for high performance document processing. Available pre-built components include:
- Metadata parsing based on document structure, using the Search Technologies Parser Foundation Classes (see next section)
- Text extraction from a variety of formats (using Apache Tika)
- Metadata manipulation
- Relational Database I/O
- High performance crawler input
- Feeders for inputting data to the system, including:
- Crawler data feeders (ARC file reader)
- File system feeders
- RSS feeders
- Relational Database feeders
- CSV file reading and sub-document processing
- Classification and tagging based on content scanning and ontology matching
- Content filtering and document weighting based on ontology matching
- Handling of multiple data streams or collections with separate co-existing configurations
- Index transformation components for submission to search engines (tested for Solr/Lucene, the Google Search Appliance and FAST ESP)
- A scripting module using the Groovy language for implementing small updates without Java programming
- Building of Solr/Lucene indexes directly
- "Content Control Database" component for enriching metadata based on document source URLs
The Parser Foundation Classes
For metadata extraction from full text documents, Search Technologies has created a framework of Parser Foundation Classes. Many documents are semi-structured and contain fields or areas that can provide useful metadata such as dates, titles, descriptions, authors, etc. Parser Foundation Classes provide a rapid development framework for such requirements and utilize regular expression rules in a flexible framework that also supports fallback rules. Typically, a parser will try one pattern, evaluate its effectiveness, try a second pattern, and so on, until a pattern is found which is successful. The results of the pattern match are stored in XML DOM fragments:
The Parser Foundation Classes also provide a framework for metadata normalization, again through pattern matching.
Finally, they can be used to split large documents into smaller pieces and to create hierarchies of those pieces (called granules) as necessary to describe the document structure. Once granules are parsed, they can be combined together in an XML structure describing entire files. XML for each file can then be combined to provide the XML for a rendition (of multiple files), and finally, metadata from different renditions can be combined to create the complete representation for a document package, as follows:
The resulting XML package is then transformed to create the final, persistent XML representation of the document metadata.
Tagging and Coding
Content tagging involves scanning through the document for domain-specific terms that provide evidence of document categories or domain relevance. Several recently implemented examples follow:
- Country Tagging: The ontology in this instance contains names of countries and major cities. Documents are tagged for a country if they contain sufficient evidence that the country is strongly referenced within the document. Evidence is stronger for mentions in the title, the first n lines of the document, or if additional evidence is present (for example, mentions of Moscow supporting the tag Russia).
- Domain Evidence Strengthening: Domain evidence, gathered with the support of an ontology, is used to adjust document relevancy. This helps documents from general news corpuses that focus on a subject of interest to a specialist audience, to become more prominent within search results.
- Authority File Enrichment: Authority files (such as lists of congress members, or companies) are used to validate metadata and add hierarchical classification structure to documents. This, for example, can enable a simple search term to match all "cars" or all "hybrid cars".