Improve Search with 10 Advanced Techniques You May Not Have Heard of
For your enterprise search, e-commerce site search, and other search-driven applications
Today’s organizations put a strong emphasis on data quality and retrieval as it is critical to delivering value to customers, improving operations, and increasing business value. With multiple data formats, sources, and security models, the importance of search is no longer limited to certain business segments; it’s required by everyone. All businesses will need search-driven applications, such as digital publishing, business research, web portals (both internal and external), online directories (of employees, products, parts, vendors, customers), e-commerce, and many more.
We’ve been working on search engines for a very long time. Along the way, we have discovered and refined a lot of advanced search engine features which can bring enhanced user experience and competitive advantages. We apply a formal, well-practiced process to ensure a complete understanding of an organization’s data and how it helps users work towards business goals.
Here are 10 advanced search engine tuning techniques that companies often either overlook or find very challenging to implement. But with proper setups, these can provide tremendous improvements to search and data analytics.
1. Index-time joins
Having trouble de-normalizing your documents so they can be searched efficiently? This is often the only way to have sub-second search over very large tables. Do you have multiple streams of updates from multiple sources which need to be fused together? Search Technologies has created the staging repository to solve these problems. It does index-time joins and ensures that the normalized record is always kept up-to-date no matter what changes.
The staging repository will receive multiple streams of input from multiple tables, and will automatically join and reprocess records as each input stream changes.
For example, suppose you have two tables: Employees and Business Groups. Every employee works within a number of business groups, and so these two tables are typically joined together and denormalized to allow users to search over employees using Business Group attributes.
The staging repository will automatically join these two tables at index time, so a set of denormalized search engine indexes are maintained. Any change to an employee will be joined with that employee’s latest business group attributes. Any change to a business group will automatically re-process all affected employees. It is super cool.
As more and more data warehouses and large business systems move to search, index-time joins are becoming more and more of a critical system requirement.
2. Custom aggregations
Search engines today are being used for real-time analytics – an area which was previously handled by relational databases and OLAP (On-Line Analytical Processing). Search engines can provide business analytics in seconds (rather than hours) and are much more flexible, vastly more scalable, and easy to use.
Ever wish there was an aggregation statistic you could get from the search engine which is simply not available? We can build it for you based on your unique data sources and data structures. Modern search engines allow for plug-ins to create custom aggregations. We have experience in creating these aggregations, including highly scalable, distributed aggregations.
3. Normalized and custom relevancy ranking math
A relevancy score can be very helpful in determining how well your engine is performing. But sometimes, standard formulae (like TF/IDF) typically vary from 0.0 to 25.0 or more (maybe even a lot more), making it difficult to judge changes in accuracy.
To help solve this problem, Search Technologies has completely re-invented relevancy ranking math so that all values are normalized from 0.0 to 1.0. This makes displaying and comparing results, especially across queries, much easier.
But normalized scoring is just the first step. Search Technologies can create custom relevancy ranking mathematics for any purpose. This can be used for a wide range of special needs, such as date biasing, popularity ranking, personalization, machine learned relevancy, and statistical spam detection.
4. Create custom search operators
Are there advanced, custom search operators you’d like to have that are not provided by the search engine? We have done lots of them, including XML search, bounded wildcards, distance ring sorting (sorting geographic results into logarithmic bins: 0-5 miles, 5-10 miles, 10-50 miles, 50-100 miles, 100-500 miles, etc.), and many other special cases.
5. Translate queries from one search engine to another
This is very helpful for maintaining legacy applications when you move from one search engine to another. Doing this properly will help you preserve application search features that have worked well for your user community and will save time from having to program new query functionality from scratch.
We have sophisticated query parsers which parse the original query into an intermediate, engine-agnostic structure, and then query builders which will build that intermediate query structure to the destination query language. These tools have eliminated months of development effort from many search engine migration projects.
6. Show exactly where memory is being used, and how much
For systems with extreme scalability (billions of records), memory usage is critical. There are tools and methods for determining memory usage and allocation down to the field level.
What data fields and search functions use up the most memory in your search engine? How can you change your queries, index structure, search engine configuration, and search behavior to conserve memory?
This function is critical to ensure your search engine’s peak performance and uptime whenever massive scalability is required.
7. Relevancy improvement
We spend a lot of time looking at query results and asking ourselves: “Does this look good? Is it better than before? Why or why not?”
When working to improve search engine accuracy, you need to double (and triple, and quadruple) check every query, every structure, and every statistic. We have found all sorts of problems and bugs throughout search systems, including dozens of cases, such as:
- The query structures weren’t quite right
- The weights weren’t set
- Issues with date mathematics
- A bug in the lemmatizer
- Acronym lemmatization
- Random thesaurus expansion issues
- Document security issues
- … and so on
Constantly testing and measuring search engine scores help us refine those query structures and address issues that cause poor performance.
View our on-demand webinar for a full discussion (and live demo) of relevancy ranking algorithms.
8. Dynamic, searchable fields
Wouldn’t it be nice to update a field in RAM without re-indexing the document? This is quite possible and we have done it for several customers. We do this by maintaining the field bit-lists in RAM and providing REST end-points to the search engine for updating these bit lists with zero latency. Flags are stored in a NoSQL store for persistence and reliability.
Dynamic fields are perfect for features such as read/unread flags (per user), search result folders (per user), ‘hide this document’ (in the search results, per user), dynamic inventory availability, dynamic status updates, etc.
Dynamic field flags can be searched and returned in search results.
9. Computing and saving document enrichments
Do you want to enrich your documents before they are indexed? Do you want to save these enrichments someplace where they won’t get thrown away? Do you want to compute these enrichments offline, in the background? A staging repository can do all that.
The staging repository maintains separate “scopes” where enrichments can store data. These scopes are safely maintained in the staging repository even as the document is updated or changed.
Typical enrichments can include:
- Extracting text from images using Optical Character Recognition (OCR)
- Semantic document vector analysis
- Complex entity extraction
- Document semantic analysis (using offline techniques such as semantic co-occurrence analysis)
- Categorization and clustering
- ... and so on.
10. Fuzzy name searching
Do you need to find names, even when they are incorrectly or poorly spelled? This is a common use case for patents, bank accounts, no-fly lists, terrorist databases, and master data models.
Unfortunately, standard search engines are designed to search over documents and not names and the two are very different. In a name, a single token misspelled can affect a large proportion (or even all) of the searchable content.
Your search engine can be configured to index all of the pieces of the name and provide excellent fuzzy name matching over enormous databases of names. Rather than simply throwing the names at a search engine with “fuzzy” turned on, we prefer categorizing the type of matches and ensuring that each type retrieves documents properly before being combined together. Each category is carefully tested for accuracy and then weighted and combined with the other categories to create the complete search.
Read more about fuzzy name matching methodology.
Almost never do I encounter a “standard” search engine implementation. Every organization has unique user needs, business requirements, and data structures. Search engines are such a widely applicable and complex technology that I believe most businesses lose a lot of the practical, everyday advantages by not having a search engine consultant. The functionalities discussed here are just scratching the surface of all the things that can be done to make search better. Having a search expert to help identify what’s needed for your unique search engine needs will decrease risk, reduce re-work, save costs, increase search quality, and drive business value.