The DPMS Implementation Process
When starting a brand new project, Search Technologies applies a formal, well practiced process to ensure a complete understanding of the data and the business requirements to be addressed. The initial design and documentation process is summarized below.
The steps of this assessment process are as follows:
- Kickoff: This involves a meeting between the customer’s Domain Experts, and Search Technologies’ Architects and Data Analysts to discuss the data and the application’s objectives. The kickoff meeting will typically also adress logistical details for data transfer.
- Data Analysis: In this important step, the raw data is examined by a Data Analyst who uses a variety of tools to inspect the data to identify its structure, encodings, quirks, cleanliness, and statistical makeup. The end result of this analysis is a thorough understanding of the characteristics of the raw data, and this is used to drive the whole DPMS implementation process. Usually, each data source requires separate analysis.
- Write the DMD: The Data Model Design (DMD) is the controlling document for the entire DPMS engagement. Further details for the DMD follow.
- Search Technologies’ Architect Review: The first review of the DMD is done by an experienced search engine architect who has become familiar with the application’s objectives. The architect may return the DMD to the Analyst for revision, as necessary.
The Data Model Design (DMD)
The Data Model Design (DMD) is the controlling document for the entire process. Search Technologies has written dozens of fully detailed DMD documents, and has found this to be a very successful method for creating document processing systems that are delivered reliably and on-time, even where extremely large, varied, or corrupted data sets are involved. Although each DMD is customized for the particular engagement, most DMDs share some common sections, such as:
- Overview and description: This provides an overview of the document collection for the benefit of the implementation team.
- Metadata Schema: All document collections have associated metadata, and this section identifies each metadata element and how that element is represented in the system, as well as the "arity" (multi-valued constraints) and data format of each element.
- File Processing: This contains details on how files are processed, including the gathering of files into packages (if necessary), format conversions, metadata enhancement, splitting or joining of files, storage in a content management system (CMS) or file system, metadata loading or unloading from outside systems, etc.
- Splitting: This section describes how documents are split (as necessary) into sub-documents. For example, magazines may be split into individual articles for better search presentation.
- Parsing and Extraction: The gathering of metadata from full text representations of the document through a variety of parsing and extraction technologies – perhaps using the Search Technologies Parser Foundation Classes framework, or third party entity extraction software.
- Index Transform: This describes how document metadata is transformed and then mapped into search engine index fields.
- Search Configuration: A list of all fields and query processing tasks necessary to implement the desired search features.
- Results Presentation: This describes how the search engine index fields are formatted for results presentation.
- Browsing and Navigation: Identifies the search queries and index fields required to implement search navigators (aka facets) and other collection browsing paradigms.
Multiple Document Collections or Streams, & Data Fusion
Care must always be taken when multiple document collections or data streams are merged into a single search experience. Getting this aspect of search implementation right is particularly important to providing a useful “one box” search facility to a diverse audience of users.
DPMS provides well-practiced methods for handling difference-merging scenarios, including:
- Multiple Distinct Document Collections: To maintain the richness of each collection, the system must accommodate multiple metadata schemas, results presentation templates, and collection-specific search navigators and browsing methods. This ensures that users wishing to drill down into specific collections have a rich, productive search experience. Relevancy ranking (through index translation) must also be handled with care to ensure that each collection is fairly represented in “one box” search results.
- Multiple Overlapping Document Streams: This requires careful analysis and handling of duplicates that are detected and can require metadata merging of common documents. Depending on the streams to be incorporated, additional attention may be required to metadata and format normalization.
- Outside Enrichment / Multiple Metadata Streams: Sometimes multiple streams will be combined into a single document, for example when social comments or ratings are combined with editorial classifications or categories - and the raw text content of the document itself. This may require the use of fuzzy or inexact lookup technologies.
Customer Review Process
Once the DMD is complete, Search Technologies recommends the following process for DMD review and update: