The Mysterious Third Dimension of Scalability: Complexity
Over the last year or so, I've come to realize that there is a mysterious third dimension of scalability, and that is scalability towards ever increasing complexity.
I know what you're thinking: "Complexity? Are you crazy? Systems should be designed to be simple. K.I.S.S., right?"
I understand your point of view, and I am certainly a huge fan of simple, straight-forward architectures. After all, the best way to avoid problems in component X is to create an architecture which doesn't need component X at all.
But there are situations where complexity is a requirement, and it's best to deal with the problem head-on, rather than simply hoping and wishing it won't become an issue. I have two examples:
Example 1: A Hosted Search Site - Hosted search sites can either be internal (for example, all of the divisions within a large pharmaceutical company sharing a central search engine) or external (a service-based search offering). Search Technologies has experience with architecting both such systems.
Hosted search sites, by their very definition, must be able to handle many customers, and the more customers the better. Provide a system which is too feature-sparse and too inflexible, and it will be applicable to too few customers and the site will not be cost effective. There will always be a desire to "add more", and if architected properly from the start, this impulse can be embraced rather than feared.
Example 2: Publishers - The model for growth by publishers has always been to expand their offerings to more and more defined niche markets. This is true for publishers of academic and technical offerings ("Aerospace Power Journal", "Resources for Feminist Research"), general public offerings ("Arizona Foothills Magazine", "Inside Kung-Fu"), or government documents ("Riddick's Rules and Procedures", "Ways and Means Committee Prints").
Publishers will have many different collections of documents, each with its own metadata, display requirements, and relevancy ranking desires. But more importantly, each collection has a unique, defined, and invested customer base. Such a wide variety of customers will each want to search and view their own offering in a way that is most natural for their world.
Architectures Which Manage Complexity
When architecting a search system which must scale to a very wide variety of features and presentations, I strive to achieve the following goals:
- Allow collections of documents to "process themselves"
- Create transparent, extensible data objects and indexes
- Reduce or eliminate sharing
Let's discuss each of these goals individually.
Collections which "process themselves"
Recently, we've been moving more towards architectures where the collections of documents process themselves. If we imagine each document as an object which lives within a collection of documents, that document can have methods for presentation and relevancy ranking. These may be methods just for that document or document type, or (more likely) methods inherited from the collection as a whole.
In practical terms, we have achieved this goal in a variety of ways. First there is a notion of relevancy ranking bins. If the search engine provides a series of these bins, each scored appropriately, then each collection or document can determine for itself which of its own fields should be considered "most important", "very important", or "less important".
Technically, this is usually done with using XSL transforms or componentized software invoked via a lookup table based on the document type or collection code. This code copies metadata from the document to the appropriate relevancy ranking bins, occasionally transforming the metadata as necessary.
For example, the authors of articles in "The New Yorker Magazine" are very important to the readership, since this magazine often has contributions from well known authors (such as John Updike, Woody Allen, Salman Rushdie, etc.). Therefore, the "author" field might be indexed as "very important". Contrast this to "Travel and Leisure", where the "location" metadata field would naturally be indexed as "very important". In this way, each collection can determine how it should best be searched.
Second, documents or collections may define how they should be displayed on the search results. Rather than having a single presentation method into which all documents fit themselves, it is possible for each document type or collection tospecify its own method for presentation. This allows presentations to be tailored to the needs of an individual collection. This can be as simple as choosing and formatting metadata, or as complex as displaying thumbnails and other multimedia.
Search Technologies has achieved this with clever use of presentation scripting. This works as follows: 1) The user interface requests the HTML for presenting a document, 2) The search API layer fetches the documents' type or collection code and uses this to locate the presentation method (the presentation method is specified as a script in a configuration file for easy modification), 3) The API executes the script which returns the formatted HTML to be included in the search results.
Such an architecture allows for the ultimate flexibility in search results presentation, and allows for each document and/or collection to tailor its presentation appropriately.
Third, collections may choose any of a number of other presentation features, such asspecifying navigators(aka facets or drill-down parameters) to be displayed when searched, or choosing search fields to be presented on an advanced search page.
Finally, for some customers we go even further. Documents can split themselves up into pieces and index each piece separately (this is good for things like magazines which contain multiple articles), documents canparse themselves into metadata, and perform other types of processing such as PDF digital signing and HTML conversion.
Transparently Extensible Data Objects and Indexes
These days, my favorite acronym is XML. And of those three letters, by far the most important is X.
X = eXtensible, and this is a glorious feature when architecting for managed complexity.
When I talk about "transparently eXtensible data objects", I mean that documents and collections can add their own metadata fields, and that these metadata fields can flow unencumbered through the system. Like "oil", Mozart would say.
XML is a wonderful medium for this. One can add elements to XML - typically in predefined "collection specific" or "document specific" areas - without affecting downstream processing, assuming that your system is architected properly. These new metadata fields can then flow through the system and magically find their ways into search engine indexes, or onto search results.
The key is to ensure that all of "the parts in the middle" are both data neutral andtransparently extensible, that is they allow for new and/or changing metadata elements without any reconfiguration or (especially) reprogramming.
There are various technologies and features which aid in creating such systems. The first is XML, which can be processed with XSLT, the standard language for manipulating XML documents. XSLT works very well in these situations since it can easily do things like block copy metadata attributes without having to know the details.
Some search engines have special features that can help as well. FAST allows for XML fields (called "Scope Fields" in FAST parlance) which can index any arbitrary XML document and allow for unconstrained searches within it. For example, I could perform a search like this:
xml:mods:extension:congMember:(@chamber:senate and name:mikulski)
which would match the following XML:
Of course, such a search expression is overwhelming for the typical user, so usually a custom query parser or advanced search form is recommended (ideally, the fields available to the query parser and advanced search form are specified by the collection as well).
Lucene has taken a different approach, in that it does not require any pre-defined search fields. Field names can be specified by the document or collection at index time.
So, if a document needs a "fiscalyear" field, this can be defined as:
via a transform, say, and then automatically indexed into a "fiscalyear" field in the Lucene indexes. Lucene does not require "fiscalyear" to be predefined, so each document and/or collection can have its own metadata fields.
Another technique for transparent data structures is packing and unpacking (known as "serialization" and "deserialization" in programmer parlance). This is typically used for fields which are display-only. A collection or document can say "I need these 4 fields for my search results", and can pack them into a "save me aside for display" field (we like to call this the "resultsbundle"). Then, when the search result needs to be presented, the fields can be unpacked from the resultsbundle and then made available for presentation.
And a final technique are the "generic" fields – often used for things like navigators (aka facets or drill-down parameters). Ten (or so) such fields can be predefined in the indexes, available for individual collection or document use. Once indexed, the collections can choose which fields should be displayed as navigators when the user has chosen the collection.
Reduce or Eliminate Sharing
This is not so much a search engine goal as a general architectural goal. In systems designed for high levels of complexity and variety, sharing should be reduced and/or eliminated where possible.
Simple examples include configuration files which contain lists of collections. It is much better to have separate configuration files, one for each collection where possible.
Binary libraries can, of course, be shared, as long as multiple versions can exist in the system simultaneously. There is no purpose for upgrading collections which are working just fine. Please, don't fix what's not broken.
Finally, collections need to be individually deployable. This is the entire reason why we eliminate sharing. If each collection becomes an independent actor within the system, then these actors can be individually upgraded, added to, or removed without affecting others collection. It is crucial for the overall system design to take this into account.
Conclusion: Complexity Is Fun!
When first approaching a new system with a wide ranging and seemingly infinite set of requirements and functions, I think the first instinct is to just give up.
But, of course, that's no good, and so the second instinct is to tackle the first few problems, hoping, I think, that after solving the first few collections, customers, or requirements that all the rest will just naturally work out.
But one can do better, and one should do better. Naturally it's impossible to predict all future requirements and situations when first designing such systems in which complexity scaling is an issue.
But one should at least be able to determine the areas of the system where there will likely be the most variety. If you can concentrate on those areas, design architectures which concentrate the complexity and variability into well defined areas of the architecture, then you are on your way to creating an architecture which is designed for complexity scalability.