The Six Commandments of Search Implementation
I was recently asked to do a presentation on "Search Best Practices". Naturally, this is a subject that I think about all the time. What makes a good search system? What works? What doesn't?
Personally, I don't care much for the phrase "Best Practices". It's a squishy phrase which has become overused - often to justify what is nothing more than an opinion.
So let's step it up. What follows are "Search Commandments". These aren’t just opinions. These are search system design rules which have proven themselves over and over again in countless search applications.
Commandment #1: Thou Shall Not Join
Search is fast. Millions of records can be searched in a fraction of a second. How is this possible? Because there are no joins.
Search engines operate by splitting the problem into many smaller problems, and then executing all of these in parallel:
This is only possible if you don’t join tables together. Joining tables is a fabulously powerful and indispensable RDBMS technique. But it is a terrible idea when implementing a search engine. It will slow your search results to the point where you no longer meet user expectations.
How do you avoid joins? Generally this means flattening all of your tables (de-normalized) into a single table structure.
We realize that this means more indexing (sometimes a lot more) and larger indexes (sometimes a lot larger), but that is the price we pay to maintain sub-second performance over millions of records. In (nearly) every situation, scaling indexing and indexes is easier and more practical than doing joins across multiple search results.
Of course we realize that there are some search engines which do provide embedded joins, and in specialized circumstances (e.g. small databases, slow queries, small numbers of users) these joins may be acceptable. But in general, any system of substantial size will need to avoid joins as much as possible.
Commandment #2: Treat Thy Search Engine Not As Thy RDBMS
Search Engines are not relational databases, and you should not approach them as if they were. Specifically:
- Don’t: Expect joins (see previous discussion)
- Don’t: Use ETL – Extract, Transformation, Load
- Don’t: Use Spring / Hibernate
- Don’t: Use your search engine as a repository
- Don’t: Process large results with complex business logic
The message here is that a search engine is primarily an index and not that great as repository for data storage. Further, the “RDBMS ecosystem” (development tools, processes, techniques, typical approaches to solutions, etc.) are generally not that useful for search engines. Indeed, we’ve become involved with the rescue of numerous distressed search systems where the fundamental problems were caused by the application of RDBMS “best practices”…
Why not Extraction, Transformation, and Load (ETL)?
ETL operations typically operate over whole tables or large batches of data, applying a single operation to all records before going onto the next operation. This works if the data is relatively clean and well structured and provides good performance within an RDBMS environment.
Unfortunately, the raw input for typical search engine applications, document data, is dirty and unstructured. This means that most ETL operations will fail most of the time when operating over typical search engine data sets. This leads to endless frustration and data loading delays.
Instead, Search Technologies recommends processing records one at a time with an integrated quarantine process for aberrant data records. This architecture will easily beat standard ETL techniques in quality, completeness, and time to market.
Why Not Spring / Hibernate?
If you’re a java programmer with relational databases, you probably love Spring & Hibernate. And yes, these tools make it quite easy to manipulate and transform records of data around your application.
But not easy enough. Search engine documents have large and complex metadata schemas which are far more dynamic than can be gracefully handled by Spring/Hibernate architectures:
- Hundreds and sometimes thousands of complex structured metadata fields
- Not all fields are known ahead of time, new fields are being discovered all the time
- Many fields are carried through the system “just in case they are needed someday”
- Fields are often dynamically created or dynamically provided by outside sources with no prior notification
The problem with Spring/Hibernate is that changes in fields require recompiling and redeploying the software. Search engine systems which depend on Spring/Hibernate architectures will spend endless time in test & deployment cycles.
Instead, Search Technologies recommends an XML (or JSON) based architecture. The eXtensibility of XML is ideal for dynamic environments. Properly designed, an XML-based architecture will allow for dynamically generated fields to flow through the system, to be created or deprecated as needed, all without causing re-deployments of the base source code.
Why Can’t My Search Engine be a Repository?
Search Engines are not repositories. They do not have locking, two-phase commit, or transaction journals. They cannot be depended upon to safely hold data in the face of system or hardware failure.
In addition, search engine indexes may not be the most efficient place to store large blocks of data – such as original document data. It may be best to store documents in a simple file system, RDB, or Content Management System instead.
Of course, you will likely want a text rendition of the data for “highlighted teasers,” those snippets that you see under each search result. But other than this, we generally do not recommend storing document data in the index unless it is used for search.
Beware of Processing Search Large Result Sets with Complex Business Logic
This an extension of the “no joins” rule. Basically, to maintain performance, you should be very careful not to select large results sets (i.e. over 1000 documents) from the search engine and then process them with complex business logic.
The issue is that selecting large results sets from the search engine is very slow, and for many search engines it may not even be possible (or at least not possible without multiple transactions).
Wherever possible, we prefer performing this logic either as part of the search expression or during document processing, before the document is indexed. 99% of the time, this is possible. If you are one of the rare 1% which must have complex business logic for your search results, it may be worth customizing the search engine logic from the inside-out (i.e. modify the engine itself – which can be done with open source search systems such as Solr Lucene), but beware that this is a difficult task which should only be attempted by experts.
Commandment #3: Reserve in Thy Search Engine only What is Most Needed for Thy Search Results
To maintain engine performance and stability, you should be careful to only store a minimum amount of data in your index. Otherwise, the engine will end up sifting through large amounts of data, index caching will be adversely affected, and performance will degrade.
Generally speaking, your index should only contain:
- Data which is indexed to execute searches
- Data which is necessary for the presentation of search results
Here are some examples of situations which I’ve encountered where excess data is stored into the index:
- We didn’t know what they wanted, so we just put it all in there.”
- Implement better processes to determine content requirements before data is loaded.
- “We’re serving the entire document from the search engine”
- Use a real repository instead
- “We have over 300 fields!”
- Consider using dynamic fields, indexing XML data directly, or combining fields where possible
In all of these cases, we recommend reducing the amount of data in the index to improve performance.
Commandment #4: Be Ye Not Careless As To What Is a Document
This is the most difficult of all of the commandments. Determining “what is a document” when architecting a search engine implementation is more of an art than a science.
Before you begin, you should ask yourself some questions:
- What is the best granularity to index?
- What granularity is required to avoid joins? What granularity is required to reduce the number of documents in the index to a manageable number?
- What is the most usefully searchable unit?
- When the user does a search, what unit of result will likely be most useful for them? What unit would they most likely want read or print?
- What is the most semantically consistent unit? What unit has good, semantically useful titles?
- What unit of search is generally about the same subject? Are there larger units which may contain multiple, diverse subjects that should be split into pieces before indexing?
- Where can you reasonably break a document?
- Do you have physical files or embedded marks which can be used to divide up documents? Are related documents stored together? Should multiple files be merged together?
- What is the user expecting to search over?
- When a naïve user approaches the search engine – realizing that they have no insight into the physical mapping of the data – what granularity would they expect to be retrieved?
- Will the results be swamped?
- When displaying the results, will the results be swamped by a lot of small, generally similar documents?
- Will some documents be retrieved for all searches?
- Big files like dictionaries, indexes, tables, etc. will typically contain such a wide variety of matching terms that they will always be retrieved. Perhaps these files should be split into pieces?
- Will there be too much “cross talk” in the search?
- “Cross talk” is when you get a two word query where one word matches in one part of the document and another word matches in another, unrelated area of the document.
- For example, you could search for a name such as “Bill Fowler” which matches a single document which contains both “Bill Johnson” and “Debbie Fowler”.
- Dividing documents into smaller pieces (or using other embedded markers) can often help avoid cross talk issues.
The following are some of the different types of document granularity organizations we’ve had to deal with:
- Magazines: Should you index the entire issue as a single entry, or each article separately?
- Generally it was felt that indexing each article made more sense and provided more semantically coherent units for search.
- The United States Code: This database has units at many different levels: Title, Chapter, Sub-Chapter, Part, Sub-Part, and Section.
- In this case, since people often cite the individual section (as in, “Title 35 USC Section 13.2”), it was decided to split up the file into the smallest possible unit at the section level.
- Parts Catalog: Should each “product type” (for example, “Handlebars”) be indexed, or each SKU (for example, “black handlebar for a Kawasaki 3200”)?
- Since SKUs are associated with prices and are more useful to the user (in other words, users are typically searching for handlebars for a specific bike), we recommended indexing at the SKU level.
- Job Search Application: Should we index a record for each applicant, or each résumé (each applicant can have multiple résumés)?
- Since recruiters are looking for people (not résumés), it was decided to index each applicant as a separate entry, not each résumé.
- Terrorist Names or Aliases: How should we search for people names, when each person can have multiple aliases?
- In this case, since we were doing pattern searches over name fragments, we decided to index the aliases in separate documents to reduce cross talk of the search expression which is much more problematic with pattern searches than it is with full text searches.
- Clutter in the search results was reduced with field collapsing (aka, results grouping).
Finally, when trying to determine the optimum granularity of the documents in your index, you’ll need to consider what search features are available from your search engine. Many of the issues described above (lack of joins, cross talk, swamping of search results) can be alleviated using special search engine features, including:
- Full XML searching (also known as “Scope Searching” to preserve some embedded structure within the document
- Field collapsing / results grouping / duplicate removal
- Multiple fields
Commandment #5: Let There Be But a Single Purpose for Each of Thy Search Engine Fields
When doing search engine architecture, be really clear as to the purpose of each field. If you discover that a single field has multiple uses, consider splitting it into multiple fields.
This rule is intended to improve overall system flexibility. For example, what if a field is used for both X & Y? What should you do if a collection needs only X, or only Y? Or what if a collection of documents needs different values for X and Y?
The message here is that, over time, your search engine will be asked to handle many different document collections, of many different types:
Flexibility is key. Allow each document to express itself in the most natural fashion, and your system will be more accurate overall and will be better able to handle the diverse needs of your various user communities.
Example 1: A “collection code” is used for two purposes: 1) How documents are processed, and 2) How documents are grouped together into collections for end-user consumption.
Recommendation: Split this field into two fields: “Processing Code” and “Access Collection Code”
Example 2: “Title” is used to show the document title to the user in the search results, and also to boost documents which have query hits on “high value content”.
Recommendation: Split this field into two fields: “title” (for results display) and “grank1” (for holding high value content, which may include an edited version of the title plus content from other fields).
Example 3: A media company uses “asset type” to show what icon to display in the search results and also to determine what documents are searched for certain types of searches.
Recommendation: Split this field into two fields: “Search Type” (for controlling search results) and “Asset Type” (for displaying the icon).
In summary, remember that fields for search engines are used to control search. This is different than fields in relational databases which are really more about canonical data storage. We encourage using multiple fields for all of the situations described above.
Commandment #6: Do Not Desirous Too Much of Thy Engine
How much customization of your search engine are you doing? If you are customizing it too much, this is an indication that you may be asking too much of your search engine.
At Search Technologies, we like to keep our search engines as close to “off the shelf” as possible. This leaves the search engine in its sweet spot, handling the tasks for which it was originally designed.
Another way to look at this is to ask yourself how many non-search functions are being handled by your search engine? Are you using the right tools for the job?
If you look at what search engines claim to be able to handle, you will get a diagram like this:
This is the “we can do it all” diagram where the search engine can handle all content acquisition, document processing, storage, and application interfaces.
A more realistic allocation might look like this:
In this second diagram, the search engine does indexing and search, and other parts of the application are handled using more appropriate tools:
- Document Processing Engine – For content acquisition and metadata normalization. These tools can handle high volume document processing, parsing, extraction, and other tasks. They tend to be much more flexible and have easier deployment strategies than putting this into the search engine itself.
- Repository – As mentioned previously, search engines are not good at document or data storage. An outside database (Content Management System and/or Relational Database) would better serve this function.
- Web Application Server – The end-user application tools provided by search engines are fine for simple user interfaces, but will likely disappoint for most UI customization tasks. Most users quickly outgrow these tools.
Remember that we are trying to do more than just create a system that works. We are also trying to create a system that:
- Can be replicated across multiple environments.
- Can be easily extended
- Uses best practices appropriate for each domain
Take, for example, the end-user interface. Several search engines provide XSLT as a method for customizing the off-the-shelf user interface. This is a perfectly reasonable approach for engines which return XML data.
However, while it is physically possible to create a wide range of user interfaces with just XSLT, serious interfaces will ultimately require a UI framework (Web Parts, Java Server Faces, etc.) which provide better componentization and more scalable development methodologies.
First, notice that there are only 6 commandments rather than the traditional 10. Search, for as long as it’s been around, is still a fledgling technology for which there are no universal standards. Maybe someday it will grow up to the point where it will have a full compliment of 10 commandments (here’s hoping).
Along the same lines, let me say that almost never do I encounter a “standard” search engine implementation. The world is a very big place, and every customer has a unique set of user and business requirements and data structures.
Yes, it is certainly possible to just lump your data into a search engine with a default configuration and a default interface – but where’s the fun in that? You will end up with mediocre, confusing results that don’t match actual user expectations or move your business forward.
Instead, we recommend taking a deeper look at the data and the end-user story. A thorough analysis of these – right at the beginning of the project - will enable you to choose the best method for representing the data in the search engine. You will then have the foundation for a corporate-wide resource that can be truly glorious.