The Unique Challenge of Searching for Names
Or How to Turn a Generic Search Engine into a World Class Directory Search
Recently I’ve been involved in several projects that require searching over lists of names, and so I’ve been thinking a lot about the unique challenge that is searching for names.
People love lists, and lists of things, especially lists of names (or “directories”), are very common throughout society. We have lists of TV shows, movies, terrorists, inventors, companies, and lots more. It’s clear that such searches are often the mission critical interface for many common applications and in some cases may be critical to society at large.
Why Document Search is Different to Name Search
Standard search engines are designed to search over documents and not names, and the two are very different. I often despair when I hear standard search engine vendors talk about directory search.
First of all, documents are much bigger – and size does matter. There’s lots more information in a document, which means that TF-IDF statistics (Term Frequency – Inverse Document Frequency, stalwart ranking criteria of many search engines) are much more reliable for documents than names. Further, a document has many more opportunities to match the user’s query than a name. For example, spelling errors in the document (as opposed to errors in the query, which is a different problem) are less of an issue, because a single word in a document is such a small percentage of the data.
The same cannot be said of name search. In a name, a single token misspelled affects a large proportion (or even all) of the searchable content. Imagine if more than half of the words in a document were misspelled! How would a search application cope with that?
Along the same lines, standard search engine statistics (such as the aforementioned TF-IDF) generally get in the way of name search. In document search, the standard thinking is that very frequent words, such as the word “very”, are relatively unimportant because they are used so frequently and in so many different ways. The same cannot be said of name search, where every word is critical. For example, even if “John” is the most frequent word in your name database, it is still just as important as “Rajeev”, because these words are an equally accurate representation of a person’s first name.
Humans love to discuss things, and the names of those things are so frequently used in conversation that we naturally start to abbreviate them to improve the bandwidth of our conversations. How long did it take before “International Business Machines Corporation” became IBM? Or “Federal Express” became “FedEx”? Probably no longer than an hour or two? Our penchant for abbreviations is not limited to company names, it includes Genres (SitCom, SciFi), people (Brangelina, Dubya), not to mention standard first name abbreviations such as Bob, Katy, Chris and Bill), movies (ST-TNG, LOTR), formats (CD, HD), TV series (Idol, DWTS, AYSTAFG) and tons of others. Names that contain these abbreviations may become unsearchable unless appropriate measures for handling these exception cases are incorporated into the system design.
It should also be noted that names are much more position sensitive than documents. Generally, although words at the beginning of regular documents (excluding the title) are slightly more important than words in the middle or the end, the difference is not that great – and depending on the type of search, words at the top may actually be less useful (as in more generic, introductory, or ambiguous) than words in the middle.
The opposite is true of names, where position is quite important. For corporations in particular, the words at the beginning of the name (FedEx, Staples, Wal-Mart) are much more important than the tokens at the end of the name (Inc, LLC, Limited, etc.). The same is true for people names if they are in the last-name-first format.
Position sensitivity extends to sub-string matches as well. People will tend to remember the beginning of a name much better than the end of a name. For example, when there’s a name such as “Hammarskjöld”, people will likely remember the “Hammars” part more than the “skjöld” part.
And finally, search results which are more sensitive to the start of the name will tend to “make more sense” to the user than results which are equally weighted across the name. For example, users will generally prefer a search on “star” which returns “Star Wars”, “Star Trek” and “A Star is Born”, over “Lone Star”, “Pawn Stars”, or “Survivor All Stars”.
Alchemy: How to Turn a Generic Search Engine into World Class Directory Search
I use the word alchemy intentionally, since there is something magical about creating a really good directory search engine for lists of names. Thanks to Google, I think the world has been trained for a certain kind of search result, and so when someone enters a name like “Bill Fowler” into a search engine and gets a name back like “William Fowlkers”, or you enter some random characters such as “despho” and you get back “Desperate Housewives,” I think the first reaction is: “Wow! How did it do that?”
At Search Technologies, we like transparency and we don’t like leaving relevancy scoring to chance. What I mean by this is rather than simply throwing the names at a search engine with “fuzzy” turned on, we generally prefer categorizing the type of matches and ensuring that each type retrieves documents properly before being combined together. Each category is carefully tested for accuracy and then weighted and combined with the other categories to create the complete search.
Therefore the term “alchemy” is appropriate in this context as well – to describe how we can mix various techniques in different concentrations and combinations to create the ideal directory-style search over names.
Before You Begin
The techniques outlined below require some standard search engine features which are available in most search engines today. Specifically:
- Query term weighting: This allows terms, phrases, and groups of terms within the query to be individually weighted
- Access to the indexing pipeline: Required to manipulate and ”over-index” names in a variety of ways before they are submitted to the indexer. This can also be accomplished with an external document processing framework
- 1 to N document indexing: An indexing feature enabling a single input document to be become multiple entries in the search index. Again, this often best accomplished using an external document processing framework to prepare the data prior to indexing
- Field Collapsing: When doing 1 to N document indexing, you’ll need a field collapsing feature to prevent multiple variations of the same name from being presented to the user
- Document and Field Weighting: This allows the incorporation of notions of document popularity into the relevancy ranking algorithm, so that ambiguous searches will retrieve the most popular variations first
Does your search engine have all of these features? Many of the leading search engines do. Do you know how to use them? Great! Then we’re all set to go.
Match Type #1: Exact Matches
So, what are the different types of matches? The first is the exact match. If the user enters the exact name, you should retrieve it. For example, “Star Wars” should retrieve “Star Wars”, over “Star Wars Retrospective” or “Star Blazers, Series 3: The Bolar Wars”. Similarly, “Meredith Nelson” is a completely different person than “Nelson Meredith”.
Match Type #2: Varying Combinations of the Same Tokens
The next type of match is a variety of token matches which mostly match the name. Generally, all tokens must be present in the name although they may be out of order, may have additional words in the name, etc. This type can be further broken down, for example, we tend to prefer names whose prefix matches the user’s query over other types, and names with a smaller number of additional terms beyond what was specified by the user should take preference over those with more additional terms (this is often handled automatically by search engines with the term-frequency statistic).
Match Type #3: Common Synonym Substitutions
The third type of match is to expand names with appropriate synonyms. For example, “Bob” to “Robert” or “Co” to “Company”. Nearly all types of names have these sorts of common abbreviations and a standard synonym dictionary – either applied as the document is being indexed, or applied as expansions to the query – will substantially improve the results.
However, care must be taken to prefer the original term where possible. For example “Bob Hope” is in such common usage that most people would think you were talking about an entirely different person if you said “Robert Hope”.
Comment: Spell Checking
Most search engines provide some sort of spell checking, at least for the user’s query. These tend to work by looking at the frequency of usage of the query terms in the database. If the user has entered a rare term, and a small spelling variation is more popular, then the system will recommend the more popular term (with “Did you mean?”).
Unfortunately, these spell checking algorithms usually fail when searching for names for a couple of reasons. First, they work against the notion of “mining the long tail”. In other words, sometimes it is those rare circumstances that are the most valuable – either to the user or to the database provider.
But perhaps most important, spell checking of the query does absolutely nothing for spelling variations in the documents. If there is a misspelled word in a name, that name will be, for all intents and purposes, completely hidden. Spell checkers will never suggest a misspelling to the user, and so such names can never be found.
And so, we’ll need to apply spell checking at index time, in order to make these sorts of documents accessible. Sometimes this can be done, but 1) these algorithms are often too slow to be performed at index time, and 2) they suffer from the “two pass indexing curse”. In other words, you may not know how to correctly spell a word until all of the names have been indexed – at which point it’s too late unless you implement a two-pass indexing algorithm.
Match Type #4: Substring Matching
So, rather than spell checking, we tend to recommend using substring matching, also referred to as “fuzzy spelling” or “N-Gram searches”.
This is where we take advantage of the size of names, which tend to be very small, like 10 to 50 characters, when compared to standard documents which are on average many times larger. Since the size of the name is so small, we can easily consider dicing up the names into pieces and indexing the pieces. This “over-indexing” will usually not impose a substantial impact on overall system resources.
Commercial-Off-The-Shelf (COTS) N-Gram or fuzzy spelling algorithms come in a wide array of implementations, and most of them are pretty crude and don’t give very good results. At Search Technologies, we like to use the following algorithm:
- Divide up the name into all possible combinations of overlapping 2-grams, 3-grams, and 4-grams. Index these into separate fields of the document
- Do the same with the query
- Weight the query N-Grams as follows: 1. Larger N-Grams (i.e. the 4-grams) are stronger than smaller ones (i.e. the 2-grams). 2. N-Grams closer to the start of the term are stronger than those closer to the end, and 3. N-Grams right at the start of the word should count for extra
- Further, if supported by your search engine, “N of M” matching operators can help improve performance and decrease the overall search space for sub-string matches. For example, the query could require that at least 60% of the N-grams are found in the candidate name before a match is declared.
The above algorithm gives very good results even if your search engine does not support fuzzy spelling.
You can further improve these algorithms with sound-alike n-gram variations, for example to substitute “ph” for “f” in all appropriate n-grams or “au” for “ou” and other similar substitutions. These expansions can be performed at index time, which will give you the benefit of the expansions without slowing down queries (much), although accuracy concerns may require that you index these variations into a separate field to ensure that exact substring matches come up higher in the search results.
Similar substitutions could be made for transliterations between characters of different character sets.
Match Type #5: Indexing of Aliases or Other Alternatives
Earlier we talked about handling name alternatives, for example “IBM” instead of “International Business Machines” or “SciFi” instead of “Science Fiction”. The best way to handle these alternatives in a true directory search application is to index each one as a separate entry. This has several advantages:
1. All of the above matching techniques can be applied to the alternative at an equal level of quality to the original – providing the highest possible accuracy when matching against alternatives.
2. Indexing alternatives as separate index entries eliminates “crossover”, where words from the main name and the alternative are accidentally put together. For example, “Terry Gene Bollea” is also known as “Hulk Hogan”. Indexing these alternatives as two separate index entries prevents searches on “Terry Hogan” or “Hulk Bollea” from being returned as valid results.
3. Crossover becomes much more critical when using substring matching. If the alternative and the main term are indexed into the same document, crossover of 2-gram, 3-gram and 4-gram patterns across alternatives will make substring matches worthless, unless the alternatives are indexed separately
Unfortunately, indexing alternatives in this way puts additional burden on your infrastructure and query features. First, you will need a way to index the main document and the alternatives as separate entries (this can be accomplished with a document processing framework which prepares the names before they are submitted to the indexer). Second, if there are changes/deletions to the main term, these changes will need to be propagated to all of the alternative terms, increasing index maintenance complexity. And finally, results will need to be collapsed on a main term ID, so that only one search result for a particular name will be retrieved.
Match Type #6: Popularity Weighting
Finally, we should discuss the impact of popularity on directory search applications. For consumer-related searches especially, most of the users will be searching for, roughly, the same sorts of items. After all, “American Idol” and “Doctor Who” are very popular TV shows, and so these are most likely to be requested by users searching for TV shows.
Therefore, directory search engines may need to include popularity weighting into the algorithm. Fortunately, most search engines have a way to provide a “static boost” or “document weight” into the index.
Naturally, it takes some experimentation to determine the proper boost for entries in the index based on popularity. Optimal methods for determining these strengths also exist. For example, one can use query and usage logs to determine which documents are the most popular, and adjust boost values accordingly. Alternatively, weights can come from third-party sources (for example, box office receipts), and then mapping that data to document boost can be done with an iterative process where one tries a particular boost formula and then statistically evaluates the results for a sample set of user queries.
Putting it All Together
To put it all together, concentrate on data processing and query processing as separate sub-systems in your final directory search application.
Data processing is required to perform synonym expansions, n-gram substring splitting, indexing of alternatives, and popularity weighting. Some search engines provide some of these features automatically, but most likely you will need a data processing framework of some sort to perform exactly the modifications that are required by your application prior to submission to the search engine.
And don’t forget that your data processing framework will need to handle updates, which (as discussed above) becomes complicated when there are multiple aliases stored as separate entries in the index for a single name.
On the query processing side, the user-entered query will need to be processed to implement exact matches, variable token matches, prefix matches, synonym matches (unless done at index time), and substring matches. Each of these different types of matches will need to be independently weighted so that the more explicit and accurate matches will be weighted over the more ambiguous and less accurate techniques. The end result will likely be a pretty large query of dozens of search terms, all properly weighted to give you the best possible return.
And when it all comes together, that’s when you get that “wow” moment. And for all of us at Search Technologies, those are the moments that make search such a fun place to be.