“We Love Government Documents”
Whenever I say this in meetings, “We love government documents!” it always gets a laugh. I think it’s because it’s like saying “I love doing my taxes!” or “I love colonoscopies” – these are not the sorts of things that most people are supposed to enjoy.
But the truth is that we do love government documents, and we love them for two reasons. First they are challenging to do right [and we love challenges], and second they represent everything which we’ve always wanted for text search – the free exchange of information for everyone.
Save the world with Government Documents
Ever since I started in Text Search I’ve always felt that Text Search would save the world. I know it sounds funny and corny, but it is really, absolutely true. We could see (even back then – pre internet, in other words, the stone age) that unstructured information was exploding all over the place, and if we could only search it, then we could crack open the vast stores of human knowledge for everyone to share.
Of course, in the 1980’s and 1990’s, the kinds of jobs we got were for mostly commercial applications, that is, making more money for people who already had tons of money. This wasn’t exactly “saving the world” like I had envisioned, although we did do plenty of fun search engines for encyclopedias, dictionaries, and news reports, all of which were quite useful to the world at large.
But that’s nothing like government documents. Government documents are the ultimate goal for “save the world” text search geeks like we are. Government documents are written by the people, for the people. These are the most important documents in the world and the ones which really affect how we live. Knowing the rules and regulations – what chemicals cause cancer, building codes that help houses stay up in hurricanes, safety zones in the ocean and air – these are documents that actually save lives. And government hearings, discussions on the floor of congress, laws – these documents allow us to keep track of our elected representatives, discover fraud and corruption, and become a well informed and well educated populace.
The Challenge of Government Documents
But searching government documents is not easy. There are a whole host of reasons why they require a much higher level of care and data preparation than do most document collections.
Reason #1: They have been around for a very long time
There’s nothing like the government for long-term continuity. And when we approach government document collections, some of these have been around for hundreds of years.
As everyone knows, things change. Language use changes, document conventions change, electronic formats change, etc. These changes add up over long periods of time and so when indexing government documents (for the highest possible quality), one must accept that there is a wider range of variation in data structure than you will typically find in most standard text search applications.
What does this mean? It means that government documents require architectures and structures which can handle this variability without breaking the bank. Such architectures must be flexible to changes in data structure, must allow for numerous fallbacks, and must have procedural processes (quarantine procedures) for handling those small number of exceptional cases for which it is impractical to program algorithms.
Reason #2: Semi-Structured Data
Because these documents have been around for so long, and because they are often used by hundreds if not thousands of lawyers and bureaucrats as part of their daily work processes, many of these documents contain embedded metadata structure which is vital for their use.
For example, rules and regulations will typically mention docket numbers, agency names, and sub-agency names which may not be labeled, but may be placed in the same position in the file for every regulation. Congress members’ names may be all CAPS if speaking, or editorial notes may be placed throughout with special headers.
And documents are often full of references to other government documents. These “standard references” are much more common in government documents due to the legal nature of these documents, and the requirement for traceability. No law can be added to the official legal code unless its source is carefully identified, be that a bill, an executive order, a treaty, or the ruling of a regulatory review board.
So naturally, extracting this semi-structured data is often very important to maintaining the integrity of the data, and this data is very often required by expert users (see below) who will have deep research needs that depend on document references and metadata coding.
Reason #3: Document structures (Table of Contents, Indexes)
Since these documents have been around for so long and typically in print editions, many of them have their own eco-systems of supporting documents. These supporting documents can include multi-level tables of contents, and indexes of many different types (name index, subject index, congress member index, etc.).
Further, the documents themselves may be organized by placement into physical volumes and logical hierarchies, such as title, sub-title, chapter, sub-chapter, section, sub-section, part, sub-part etc. Often these sections are uniquely numbered so they can be used as references, either from other government documents or within the document collection itself.
Translating this structure into an effective search is perhaps the most difficult challenge one faces when processing government documents. It requires many sophisticated techniques, such as building and traversing document hierarchies, propagating metadata up and down the hierarchy, accumulating smaller sections into larger sections (and summarizing metadata such as part-number ranges across smaller sections), and cross referencing from indexes and TOCs into the main document collection so that editorially extracted metadata can be applied directly to the documents to which it applies.
Reason #4: Search for the Experts vs. Search for the Citizens
Finally, for all government documents, “search” faces a fundamental contradiction in purpose: How to create a search system that works both for the expert user and the average citizen.
Lay users will want a “one box” search which works like Google – just enter a few terms and do the best you can. Such a search will need to depend heavily on careful relevancy ranking, especially across selectively weighted metadata fields.
Expert users will have a deep understanding of the document structure and will want to leverage that document structure to do their searches. This usually means a number of expert features such as complex query structures (with field and structure selectors), search within portions of the collection (and portions of documents), quick jumps to document references, case sensitive searches, exact suffix searches, document-order sorting, and so on.
Combining these two needs into a single user interface takes patience, cooperation, and a keen understanding of how each user interface feature will affect users at both ends of the spectrum. Here is where a long-term history in search can really pay off.
“We Love Government Documents”
And so you see, it’s true: We love government documents. But even more than that, we love people who love government documents, because these are the people who are so grateful for good search systems. “I never use the old search”, one librarian told me the other day. “The new system is so much more powerful.”
And that makes it all worthwhile.