Search and Very Large Publishers
Welcome to my very first blog entry ever. I remember when we first made our search engine in 1989 (ConQuest), people said, "What's a search engine?" Now search engines are everywhere! They have become a household word and I've even seen them featured on the comics pages. It's been quite a ride.
I'll be using this blog to chat about my travels as a search engine architect. Mostly I'll be talking about different search engine applications, for example searching resumes, highly volatile databases, enormous (multi-billion document) databases, high reliability situations, etc. Each of these different applications requires a different search engine architecture, and I'll "brain-dump" my thinking as I architect these solutions.
Today's topic is on Very Large Publishers.
Search Technologies has a long experience with these types of customers. Personally, I've worked with Mead Data Central (anecdote: the Chairman of the board of my old company, Don Wilson, was the first president of Mead Data), ProQuest/CSA, Encyclopeadia Britannica, several divisions of Thomson Reuters, and many others.
My last few weeks have been spent architecting a brand-new search system for the Government Printing Office (GPO), a large publisher of government documents. They have an enormous number and variety of databases (25 databases for initial launch), and each database has a large number of database-specific metadata fields.
To make matters more interesting, they also have a wide variety of users. A typical query from the public might be:
senator obama and the environment
Whereas a query from a librarian who works for the Federal Depository Library Program (FDLP, a division of the GPO) could look like this:
sponsor:(obama or mccain or clinton) and issuedate:range(2007-01,2007-12) and (environment or wetlands or "global warming" or "green technology")
So naturally, it has been quite a challenge to handle all of this variety!
Here are a few guidelines for architecting search systems for large publishers like the GPO.
First, it is important to get the complexity under control. Our architectural design has centralized the varying parts of the database design into a series of XSL transforms which map customer metadata fields into FAST index fields and a pair of simple configuration files. This architecture allows complete flexibility in database management, but in a controlled fashion that is very easy to extend and administer.
Second, we can leverage the XML search capability of the FAST search engine, the poorly-named "scope searching" feature. This allows us to specify a large amount of metadata in a simple XML field. Thus, every database can have a different set of metadata for searching without changing the configuration of the search engine or reindexing documents. Leveraging such a flexible search method is crucial to reducing administration costs.
Finally, Search Technologies is creating a custom query parser to handle the sophisticated query structure required by advanced librarian searchers. It is unfortunate that FAST provides only two query solutions, the "simple" solution (ideal for casual users, but not sophisticated enough for advanced searches), and the "FQL" (brittle and too advanced for even advanced searches). Fortunately, a custom query language can gracefully handle both situations, and is not too hard to write. This won't be the first time that Search Technologies has written a custom query parser! I know of at least two other cases where we have had to do this (Chick-fil-A and Computer Patent Annuities (CPA)).
Once all of this is implemented (slated for early 2009), the GPO should have a truly kick-ass search engine. Users will be able to find laws and regulations with simple searches, and librarians will be able to answer sophisticated questions like "How many bills with the word 'environment' have been sponsored by both a Democrat and a Republican?"