An Introduction to DPMS
Search Technologies' Document Preparation Methodology for Search (DPMS) is a collection of tools, techniques, and processes which prepare data for use in search-based applications. The majority of business-critical search applications can be significantly enhanced, or even transformed by better document processing prior to indexing.
DPMS reduces complexity and makes search applications more maintainable and flexible. DPMS also helps search to deliver user-experience excellence in a controllable, cost-effective way.
Search Technologies' Document Preparation Methodology for Search has been developed during the delivery of more than 40,000 consultant-days of search implementation services.
The overall technical objectives of DPMS are the fusion, normalization, and enrichment of data to be indexed into a search engine.
• Fusion: Combining multiple data streams or document collections into a single, sensibly-searchable set of indexes. Fusion is particularly important for applications aiming to present a “one search box” interface to users, despite the data being disparate.
• Normalization: Cleaning, extracting and normalizing content, metadata and formats.
• Enrichment: Automating the addition of metadata to documents using external or third-party resources. Normalization and enrichment provide a foundation for added-value search features.
DPMS is a process, supported by a technology suite.
• Process: DPMS is a well practiced process through which Search Technologies’ engineers analyze, design, implement and validate data intended for indexing into a search engine. This process has been used for a wide range of customer engagements, including use with dozens of extremely challenging document collections.
• Technology Suite: In some cases, the indexing pipeline provided by the search engine can be used to implement DPMS. However this is not always ideal, so Search Technologies has built a collection of tools, high performance frameworks and functional modules - largely based on open source components - that can be used to implement DPMS with any search engine.
Flexibility & Transparency
Perhaps the most important aspects of DPMS are the flexibility and transparancy it brings to the building and ongoing enhancement of search-based applications.
• Flexibility: Handling large document databases with widely varying document structures, formats, and metadata requirements
• Transparency: Including auditing, validation techniques, and quarantined document processing. Users continue to seek ever-simpler search experiences, yet many search-based applications must deal with huge data complexity. Transparency helps overcome complexity. DPMS ensures that the search engine is being fed the best data possible while allowing budgets and timeframes to be carefully controlled.
Serving the single search box
The need for DPMS has become greater as search systems look to fulfill user demand for a "single search box”. DPMS enables search-based applications to deliver this one-size-fits-all paradigm while maintaining relevancy, and at the same time provides for the needs of niche user communities such as search experts or specialized departments. The DPMS methodology excels in such environments, performing careful data analysis and metadata extraction to support expert users, while simultaneously providing collection-specific relevancy ranking and index normalization to deliver a superior single-search-box experience to the masses.
Custom Designed for Search Applications
DPMS differs from ETL (Extraction, Transformation, and Loading) for relational databases in that it operates primarily at the document level, instead of the database or table level. In this environment data errors must be constantly expected and huge data variation is the norm. DPMS is specifically designed for search-based applications and is much more flexible than ETL methods.
DPMS also allows for the transparent handling of corrupted, incomplete, or constantly evolving data – common characteristics of challenging search applications.
Key Issue: DPMS is a better alternative to traditional ETL approaches where the end user application relies heavily on search functionality.