Back to top

How Are We Building Smarter Search Engines in the Big Data Age?

Commentary on Gartner's Article: 'Insight Engines' Will Power Enterprise Search That Is Natural, Total and Proactive

Paul Nelson
Paul Nelson
Innovation Lead

A Smarter Search Engine Starts with Understanding the “Exceptions”

Innovation, it seems, comes from unexpected places (after all, if it came from the expected places, then it wouldn’t be innovative, would it?). Recently, a number of things have come together to put search and search engines in a whole new light, and they have come from an unexpected place: exception handling.

search engine exception handlingWhen looking at a search engine in this way, you begin to see exceptions all over the place.

I’m not talking about software exceptions (such as Java Exceptions or Throwables), but rather exceptions as in “the exception to the rule.” In other words, how to handle those rare (but often important) situations which the standard operation of the search engine is not correct?

The graphic on the right illustrates the methods for handling “exceptions to the rule” inside search engines. In other words, ways of “fixing” certain queries or search results.

What’s happened to me recently is that I’ve started looking at all of these items as trying to solve the same problem. This realization has helped me look at the problem differently. The result of this thinking is a series of work, currently in progress at Search Technologies, to create tools and methods that cover all of these techniques into a single, holistic system which also opens up brand new vistas of enterprise search functionality.

And, quite incredibly, we ended up with a system for creating intelligent digital assistants for everyone.


Search as Your Digital Assistant:  Siri, Google Now, Cortana, and 'Insight Engines'

And why are we doing this? Because our customers want it.

search as digital assistantsFor years now, customers have been asking for question answering systems like Siri. With Google Now and Cortana, these systems are starting to become ever more ubiquitous and therefore more in demand. And recently, Gartner has also started discussing 'Insight Engines,' a new technology that’s redefining the market around search, providing natural, total, and proactive search and insight discovery.

This just tells me that there’s really a groundswell for question answering systems.

And always, when asked about such systems, I’ve said “not now, maybe later.” Why? Because I was scared. I couldn’t see my way to a solution which was practical.

The key problem is one of domain understanding. Generic question answering systems (e.g. Siri, Google Now) only understand a very broad, generic domain. Things like movies, birth-dates, geography, etc. But that’s not what our customers want (whether they realize it or not).

After all, each of our customers wants to create a search application of their own world, whether it is search for intranet portals, e-commerce, recruiting, media & publishing, or public sector content. They have their own language, their own acronyms, their own business processes, their own way of doing things – and they expect their digital assistant to understand their unique domain and to answer questions like “Where is the TPS form?” or “How many widgets were sold to small businesses in EMEA last quarter?”

And so any question answering system would need to be heavily tuned (read:  enormously expensive) to be able to handle problems like this.

But now I’m thinking:  Yeah, okay. I think we can do this. Wow, I really think this is possible! 

And throughout my life, ever since my first search engine, a Natural Language Processing based (NLP) engine, I have always felt that understanding the query was the key to the highest possible quality search.


The New Natural Language Processing for Search Engines

I am an NLP person through and through. In graduate school I took NLP courses and created syntactic (and semantic) state machines and actually created software to implement transition networks for sentence diagraming, semantic analysis, anaphoric reference, chunking, segmentation, you name it. But all of those old techniques were so brittle and expensive that they never made it into the mainstream.

What’s changed is a new way to look at natural language processing – two levels of simplification – which simply match text against a large database of patterns, and through this matching it creates understanding. It is a sort of RISC (reduced instruction set computing) form of NLP. Instead of trying for a deep understanding of sentence structure and internal meaning, we create large databases of patterns and match the query against those patterns.

For example:

  • “TPS” > “Tabular Performance Sheet” > FormType
  • “small business” > CustomerCategory

And this is the trend of intelligent search systems today:

old vs new search systems


Using Big Data to Create Patterns

But Paul, you say, how does this help? After all, you still need to create an enormous number of patterns manually, isn’t that enormously expensive?

This is an entirely valid concern. Fortunately, there are several responses:

  1. Creating patterns is much less expensive because it no longer requires experts.
  2. Creating patterns is much less expensive because we have a spiffy new interface.
  3. Creating patterns for companies is less expensive than creating patterns for the world
  4. Get immediate benefit with just a few patterns
  5. We can use big data to create patterns!

So yes, many patterns may (ultimately) be required, but since each pattern is (vastly) less expensive to create, and we can get immediate benefit with just a few patterns, this makes this new system viable for all search engine users.

One might think that all of these methods for handling exceptions is antithetical to big data. After all, big data is all about assembling masses of data and performing broad statistical analysis on this volume to derive insights and algorithms to predict future behavior. 

However, it has now become clear to me that the two methods work well together: We use big data to create a database of patterns.

Patterns can come from anywhere. They can be manually entered. They can be extracted (using text mining techniques) from content. They can be extracted from Wikipedia, Geonames, or Freebase. Or they can be derived from user queries. They can be extracted from our customers’ business systems. Or any combination of the above.

Many of these techniques require big data to handle large numbers of tokens, large query logs, etc. The output from these processes are patterns, dictionaries, tags, etc. which are input to the pattern matching engine and drive query understanding. 

One of our customers already has over 12 million patterns, again generated via big data analytics, manual cleansing, and a combination thereof.


How Will 'Insight Engines' Transform Search?

Our goal here, as always, is to transform the enterprise search industry. Everything we do at Search Technologies is with an eye towards moving the industry forward, and this is no exception.

We intend to use these ideas to make a huge step forward: towards truly intelligent search engines. This will enable all sorts of functionality which was not practical previously:

  • Question answering systems 
  • An interface to business systems 
  • Targeted e-commerce searches (read more on optimizing e-commerce search functionalities)
  • Intelligent digital assistants 


Making Search Easy for Users: Encoding the Language of Your World

Ultimately, of course, we are creating a digital understanding of your world. This digital understanding provides the bridge between language (queries, requests, actions, and content) and the business objects which make up everything that is pertinent to your company.

And isn’t that amazing? A computer which can talk your language? Which understands your requests and your needs? Slap a speech-to-text system on top and you have your very own, customized, personal digital assistant. Cool.

We believe what’s discussed here is just scratching the surface of what such a digital assistant enabled by search and big data analytics could possibly achieve.

Imagine the possibilities!

- Paul