Back to top

Conversions, Metadata Markup, Linguistics and Token Processing

Format conversions are occasionally required depending on the source documents and the needs of the system. Recently implemented examples include:

  • Text extraction:  For the purposes of parsing, search engine indexing, or metadata extraction
  • Metadata Markup:  When the search engine returns snippets from larger documents (for example, articles from a magazine), it is often necessary to markup the snippet to include contextual information such as the volume, date, and page number of the magazine from which the article was extracted
  • Document to HTML:  This can include XML to XHTML transformations or (for example) conversion from an obsolete version of Corel WordPerfect into XHTML.  Such conversions open up the document text for viewing with a simple web browser
  • PDF Digital Signing:  To ensure the authenticity of PDF documents presented by the search application

Linguistics and Token Processing
Some search applications require additional linguistic and token processing, beyond that provided by the search engine itself. This can include:

  • Lemmatization:  An improved form of stemming, lemmatization uses a full-fledged dictionary to reduce words (for example from "tables" to "table"). Non-dictionary based methods can sometimes produce inappropriate reductions, especially in applications focused on specialist subjects
  • Special Purpose Searches:  For example, exact-case searches, exact-suffix searches, or word fragment searches
  • Document Structure Searches:   This can include features such as searching over XML structure and embedded document fields  

Multiple Collection Merging
Care should be taken when merging documents from disparate collections into a single search engine experience. Search Technologies has evolved a number of techniques and tools to handle these situations, including:

  • Collection-directed relevancy scoring:  This allows each collection of documents to determine how it should be relevancy scored.  This enables, for example, a directory of companies to give preference to the company name, while a collection of general articles emphasizes the title text and a collection of place names emphasizes location.  In this way, each collection self-determines the types of search that most appropriately match its data.
  • Flexible metadata indexing:  For situations where metadata fields are required that were unknown when the system was first installed (without needing to reconfigure the whole search application)
  • Collection-directed presentation and navigation:  Providing the optimum search experience over disparate document collections often requires collection-level direction, so that if a search is limited to a specific collection, additional search navigators, contextual to that collection, are provided to the user.  Browsing paradigms and search results presentation can also be made collection-specific.
  • Multi-Query:  A method for providing highly accurate searches over varying entity types simultaneously. For example, to show a list of people, a list of TV channels, and a list of TV shows all within the same search results presentation.

User expectations of modern search-based applications require new implementation techniques, architectures and development processes to achieve success within reasonable costs and timeframes.  This is the goal of Search Technologies’ Document Processing Methodology for Search.

Over and over again, Search Technologies has been called into projects where "standard" software techniques, development processes and architectures have proven insufficiently flexible or transparent to handle the demands of large-scale search engine implementations.

What we have tried to do through DPMS is to gather a suite of architectural philosophies, data analysis and design processes, and technology tools that are especially helpful in the world of the text search engine, and maintain a strong focus on enhancing data as the foundation on which excellent search applications can be built.

These methods have proven successful in many extremely demanding search environments, some of which had previously appeared to be hopelessly complex or intractable.

For further information, or to discuss you search application, contact us.