Back to top

The Big Data Revolution Hits E-Commerce Search

I’ve recently had the great privilege of working with some iconic brands on their eCommerce search platforms, and what we’re putting together with these companies is truly revolutionary. 


E-Commerce Search:  The current system is broken

When one steps back and takes a good look at eCommerce search today, the process is ripe for re-engineering. 

The process, in a nutshell looks like this:

  1. Find some queries which aren’t performing well
  2. Fix them by manually choosing what to display instead

This is a lot like running around a leaky house fixing drips by placing pots and pans to catch the water.

There are all sorts of problems with this approach:

  • Humans are biased and they don’t really know if their fix will work

When a human looks at a query to fix, they are making an assumption about the intent of the customer. This assumption may be correct, or it may be completely off. And when it’s wrong, it's a big deal because…

  • The query fix goes right to production

This is a most astounding situation, really. Query fixes in these systems are often pushed straight to production. And testing a query fix in a staging environment is really no better, because the only so-called ‘test’ is that the fix works as expected, not that the fix actually improves conversion. Currently the only way to know the fix is working is to …

  • Watch how the query fix behaves in production

Watching how users react in the production environment is the only way (currently) to know if the query is actually ‘fixed’ or not. And even if it is working today, it may not work tomorrow because…

  • It’s a changing world

New products are released, and terminology is coopted for different purposes. The world keeps moving and this means constantly maintaining this enormous database of manual query adjustments. This takes a lot of effort, which means only the most frequent queries get the proper attention. Which makes me wonder…

  • What about the long tail?

In these systems, the long tail often accounts for 40-60% of all queries. With the manual query-fixing process, there is only time to fix the most common queries. This means that 40-60% of the query traffic gets no attention at all.

Of course, there are times when manually setting query results to handle truly exceptional cases is needed, especially for key brand-related queries and for new products launches. But for everything else, the exception is the rule. This is a problem. Administrators are ‘fixing’ eCommerce search by trying to improve the queries, one-by-one. How exhausting! 

Let’s start by measuring accuracy off-line and not in production

And let’s start off by making an obvious, but necessary statement:

Queries do not buy products. People buy products.

I’ve been railing for some time now about this over-focus on the query. Honestly, I don’t really care about the query, it is merely the carrier of intent (and an imperfect one at that). What I care about is the person behind the query – are they satisfied? Do they get to the products they want?

So let’s understand the person.

The great thing about the Big Data revolution is that we now have techniques and methods to process the massive amount of log data involved, and to truly understand the person. We can track their every click, every purchase, every ‘add-to-cart’ and every search. This is a gold-mine of information which can be leveraged to test our queries and our search engine modifications, before changes are pushed to production.

And so, we can test the query by looking at the users who execute the query. What things did they find interesting? What products did they purchase? What products were purchased by other, similar users? And so we make a simple statement: Before we fix a query, let’s at least check to see if the products we show are likely to be of interest to the users who executed the query. Isn’t that simple? So why is no one doing it? The data is available, it just needs to be captured and used intelligently.

Similarly, we can test the engine by sending every query to the engine and seeing if it provides products that users actually want in the search results. It really is that simple. The process is to go user-by-user. Check each query executed by the user against the products the user has purchased (or looked at). If the engine returns products the user wants, this is a win for everyone, and the scores go up. See my analytics whitepaper for more information on the details of this process. 

It’s time to ditch web analytics for search

And now, forgive me, but I just want to take a moment to bitch about web analytics.

Web analytics is sucking the life out of your valuable data

I’m sorry, but it’s true. Your user activity data is so incredibly valuable, and standard web analytics tools (Adobe Site Catalyst, Google Analytics, sorry guys) take this valuable data and process it so hard that what comes out is practically useless. Web analytics turns filet mignon into hamburger by grinding it up into useless statistics, it really does. 

I want your raw logs and raw event data.

I don’t want your SiteCatalyst/Omniture/Google Analytics data. That stuff is useless to me (if you can even export it). What I want is your raw log data. Tell me what every user searched and what every user clicked – with user ID, and when. With this data, I can craft a search engine that gives people what they want, and then I can test it – off line, to ensure that when we put it into production, it’s going to work. 

Use Machine Learning to improve all queries for all users

When we look at queries, we have this graph in mind:

In this graph, the number of times a query is executed in a given period (e.g. the frequency of the query last month) is plotted on the graph, and all queries are sorted by this number. The graph shows that a small number of queries get executed many times (the queries on the left), and that a very large number of queries only get executed a few times (the ‘long tail’).

With standard eCommerce practice today. It’s only the queries in the green shaded area which get attention. This means that no one, human or machine, is really looking at improving the queries in the long tail, and these queries can account for 40-60% of your query traffic.

But that’s not the only problem, for there are plenty of queries even in the green area which do not work for users. By “not work”, I mean that the user entered the query and didn’t click on anything or purchase anything:

For these queries (sometimes called ‘abandoned queries’), the user entered a query, didn’t receive the results they wanted, and then did something else (perhaps executed another query). Abandoned queries will often account for 70% of all common queries.

Does anyone ever look at abandoned queries? Not in standard eCommerce shops. What usually happens is that the marketing team checks the overall conversion of a query and if it’s reasonable (e.g. 1-2%) then they move on to the next query.

This can mean that 80-90% of all queries executed are ignored by standard eCommerce practice in typical eCommerce shops. 

Enter Machine Learning and Big Data

So how can Big Data and machine learning help?

1.  Big data can analyze every query entered by every user

  • We don’t just look at common queries, we look at all queries
  • We also consider queries which returned no good results (abandoned queries)

2.  We use other activity by the user to help determine what to return

  • What else did the user click on? (via browse, or for other queries)
  • We can use this information as a guide to determine what to show in the future

3.  We can automatically show the best documents for common queries

  • For common queries, we can put the most popular (most viewed or purchased) products by users who executed the query at the top of the results
  • Alternatively, we can put at the top the products where the probability of purchase multiplied by the margin is the highest

In other words, promote those products with the greatest probability of returning the most revenue.

  • Note that it needs to be top products for the users who executed the query, not simply the top products overall.
  • Unfortunately, this only works for common queries (those which occur over 15 times, typically), because these are the only ones which have sufficient data to provide accurate results

4.  We can create an optimized formula to predict probability of click 

(or probability of add-to-cart, or probability of purchase, or a combination thereof)

  • This prediction is done with machine learning predictive analytics
  • This technique automatically improves all queries, including long tail and rarely executed queries 

There are key advantages to these techniques over standard eCommerce practices:

  • They are automatic.

The amount of manual adjustment of query results is substantially reduced

  • They work the same for all languages

This is critical for multi-national companies that need to scale their shopping experience to four, five, or twenty-five languages

  • They analyze all queries

Including all abandoned queries, and all long-tail queries

  • They consider the individual needs of all users.

Every user who executes a query will be considered 

But how does Machine Learning do all this?

Machine learning is, essentially, a massive pattern recognition machine. It recognizes patterns in huge tables of data, and automatically creates optimized formulas to best predict whether a product will be clicked or purchased.

And so, the process for using machine learning to optimize search for eCommerce is this:

1. Gather a massive table of data:

  • The columns are numbers or category codes for products, queries, users, or comparisons
  • Every row is a document shown to the user for a particular query (i.e. every document returned by every query executed by every user over some period of time)

2. Include “clicked” or “purchased” columns.

  • For every row, add a column which has “1” if the user ultimately clicked on the document
  • Add another column which has “1” if the user purchased the item

3. Spend a bunch of time making sure all of the data in the table is as accurate and complete as possible

  • This step typically requires 80% of the effort

4. Run a bunch of machine learning algorithms against the table

  • Different algorithms are better suited for different types of data, and so it’s often best to just experiment to determine the best algorithm
  • Typical algorithms include decision trees, support vector machines, and logistic regression
  • More complex algorithms (the so-called ‘meta’ algorithms) include random forests and AdaBoost
  • One must be careful to choose algorithms which can be feasibly implemented in a search engine (after all, this is the ultimate goal)

5. Test the algorithm

  • This is done by randomly by dividing the data into a “training” set (usually 75% of the data) and a “test” set (usually 25% of the data).
  • For every algorithm computed by predictive analytics, see how well it works on the training set
  • Keep experimenting until you find the best algorithm that correctly predicts if the user will click on the document 

The following is what the table of data would look like:

In the above diagram, the data in the table are called “signals” because, ideally, they ‘signal’ whether or not a user will ultimately click or purchase a product. But don’t let the word ‘signal’ throw you (it’s a data scientist term), signals are merely numbers or codes which describe the product, the query, the user, or how well the query matches the product. 

Signals for eCommerce machine learning fall into one of four categories:

1. Product Signals

Could include:  Total revenue for the product, total popularity, product category, product release date, price, weight, shipping cost, document content vectors, etc.

These signals are indexed along with every product in the search engine index.

2. Query Signals

Could include:  Query size, query popularity, popularity of the terms in the query (across all queries and across all documents), query type (i.e. query intent), etc.

These signals are computed for the query and provided along with the query.

3. Comparison Signals

Could include:  The search engine organic score, vector comparisons (between the user's query and the document), percentage of the query found in the document, minimum distance between all query terms in the document, etc.

It’s also useful to include the position of the document in the search result set as a column of the table (this is only included for search results clicked by the user). This helps normalize out the tendency of users to simply click on the first result in the query.

These signals are computed for every document found by the search engine.

4. User Signals

Could include:  How long has the user been on the site, the geographic location of the user, clusters to which the user is assigned, vectors of recent user activity, the referring page (or query, or advertisement) which brought the user to the site, and any other information gathered about the user (age, gender, etc.) which is available for analysis.

For machine learning, these signals are gathered from log analysis and other user data.

In a live system, these signals will be computed in real-time as the user clicks around the system, and then provided along with the query to the search engine when the user does a search.

Not all signals are required. Generally, you start off with the easiest signals to gather, and then see how well they work, and then continue adding signals and tuning the engine as more signals become available and as you get a deeper understanding of user behavior.

In addition, the machine learning process itself will identify the signals that are the most useful. Those signals which are not useful will not be incorporated into the resulting algorithm, and they can then be removed and deprecated, saving time and effort. 

But what does Machine Learning produce?

The goal of machine learning is to produce a formula. This formula combines all of the data signals in such a way as to predict the likelihood that the user will click on the document after they do a specified search.

The formula can be insanely complicated, with all sorts of decision branches, logarithms, etc., but (if done correctly), it will be the most optimal formula which is mostly likely to bring the best documents to the top of the results for any query. 

And what do we do with this formula?

We put it into the search engine to replace the standard relevancy formula (this will typically require creating a custom relevancy scoring plug-in).

The standard relevancy formula in search engines (TF-IDF) was created in the 1970’s, and has stayed pretty much the same all the way up until today. The standard formula is an ad-hoc formula not based on any sort of statistical rigor.

Armed with Big Data, we can use the steps above to compute an optimal relevancy formula which is tuned for your users and your data. Once we have this formula, we plug the new relevancy formula directly into the search engine and when we do, the engine will produce the best possible results (based on available data).

Note that, since the engine will compute the formula on all documents which match the query (which could be 10’s of thousands, if not millions), it will need to be very efficient. This often places constraints on the type of formula you can have. 

Handling ‘null’ queries / Bridging the language gap

“Stop right there,” you say. “Everything you’re talking about is involves re-ranking documents which come back from the search engine. What about results which are missing? What about NULL queries, i.e. queries which return zero results? How do we make those better?”

And you would be absolutely correct. NULL queries and missing results are typically symptoms of what I’m starting to call the “language gap”:

When  {user’s language} ≠ {product description language}  =>  {language gap}

Fortunately, there are several ways in which Big Data can help bridge the language gap between the user’s language and the product description language: 

1. Evaluate product performance

An easy first step to improving the language gap is to write better descriptions or enhance product descriptions with search keywords. But for which products?

And so we need some way to evaluate performance from a product perspective. Fortunately, the search accuracy analytics process (see my analytics whitepaper) can be mined to find poorly performing products:

  • Products which never (or rarely) appear in the search results
  • Products which never (or rarely) are clicked
  • Products which appear in the results more frequently than average, but are clicked less frequently than average (we call these “spoilers”, because they use up more than their fair share of real-estate in the search results)

If you have the actual content of the product descriptions in your Big Data machine along with the user logs, there are some useful content --> query metrics you can generate:

  • Words in product descriptions which are never (or rarely) in queries
  • Product descriptions which have the largest percentage of rarely searched words
  • Product descriptions which have too many frequently searched words (spoilers)

And finally, if you have product performance from external sources (e.g. from external financial systems for brick-and-mortar sales, or sales from other eCommerce sites), you can create some additional product performance metrics:

  • Products returned in search results at a much less percentage than external sales
  • Products returned in search results at a much larger percentage than external sales (spoilers)

Note that all of these techniques will work for all languages, a critical criteria, but only as long as the text tokenization and text processing algorithms are not locked inside the search engine (Open Source search engines are useful here). 

2. Use query chains

Users will also tell you, by the queries they execute, how to bridge the language gap. One way to do this is with query chains.

Basically, if you evaluate the list of queries the user executes (i.e. the query chain), across many search sessions, chains which occur many times can help you “short-circuit” the query process and therefore help bridge the gap.

For example, the user may search for:  

‘Ravens quarterback’ [no results] -->

‘Baltimore quarterback’ [no results] -->

‘Flacco’ [got results, got click]

(Joe Flacco plays the quarterback position for the Baltimore Ravens American Football team)

If the above sequence (or sub-sequences) occur frequently, then we can bridge the gap between “Ravens quarterback” and “Flacco”, so that when the user searches the former, we automatically include the results for the latter. 

3. Fuzzy spelling

A lot of the mismatch between user language and product language can be attributed to spelling mismatches. Often this is taken care of inside of the search engine, but its handling of this is imprecise. Big Data can do a better job:

a. Queries executed sequentially which are similarly spelled

b. Query words which are similarly spelled to content words

c. Content words which are similarly spelled to query words

The point is that doing these sorts of analysis in a large batch job inside a Big Data framework can do a much better job than a real-time search engine spell checker can do. If done in Big Data, we can afford to implement a more accurate, longer-running algorithm.

We like N-Gram analysis (which counts number of similar 2-grams, 3-grams, prefix-2-grams, etc.) for finding similarly spelled items. The results of this analysis would be written to dictionary files which will automatically augment the queries with alternative spellings. 

4. Use semantic similarity metrics

One way to identify similar products is to create a content vector (list of weighted terms) and then compare all product vectors to find similar products. This will work if you have reasonably-sized descriptions (i.e. a paragraph or two instead of just a sentence).

Once we know similar products by content, then when the user finds product X1, we can also show them similar-product-X2 (and vice-versa) in the search results. To avoid confusion, they could be styled together in the results (e.g. showing small thumbnails for “other similar products” below).

On the other axis (to help with spoilers), you can collapse similar products together into a single result as well, again using small thumbnails for “similar products”. 

5. Match to external content

And, finally, we can start bringing in external content to help with the language gap. In other words, go out into the web and find where users are talking about your products, ingest that content, and then match it up to your products.

Potential sources of external content can include:

a. Tweets about your products or product description text

b. Blogs about your products or product description text

c. Wikipedia pages about your products or product description text

d. Search results from public web sites on your product

Once ingested into your system you can start mining that content. For example, find public pages which match a poorly-performing query, find the products which the public pages reference, and then make the connection. 

The eCommerce Architecture of the Future

Putting this all together, we discover that our new Big Data centric architecture for eCommerce becomes quite simple:

And this illustrates the magic of Big Data analysis for eCommerce, namely the ability to fuse and perform analytics over widely varying types of content:  Products, user events, searches, external blogs, financial reports, etc. All of this data flows into the Big Data framework and is used to gain insights over products and users, and these insights are plugged into the search engine to provide the best possible results.

And did I mention that it works over all languages?

-- Paul