Analytics, Search, and the Data Driven Organization
Search, Personalization, Relevancy Ranking, and Predictive Analytics & All Play a Part in a Data Driven Organization
So I started this blog entry on “creating the data driven organization” thinking that I was inventing something super cool, only to discover that there’s already an O’Reilly book on the subject, and a ton of industry analyst research. Apparently, it is this year’s hot business trend.
Oh well. It won’t be the first time that I've ‘invented’ something which had already been invented (better) by someone else. And so instead, I’d like to tie this notion of a data driven organization to something more concrete and practical: How to get better search results.
But first, let’s talk about analytics, personalization and relevancy ranking, and why it’s so important to be a data-driven organization.
The Process Cycle of the Data Driven Organization
It has taken me a while, but I believe I can reduce my thoughts into a cyclical process of four words:
This process captures my vision of how businesses, business processes, and consumer interfaces will operate in the future. Note that I will use “consumer” to represent both "employee" (for internal facing interactions) and "customer" (for external facing interactions).
In addition, this process captures my view of how search engine projects should progress, and it is through search engine relevancy ranking (which is really a combination of predictive analytics + real-time consumer interaction) that I have come to understand this process more fully, and appreciate its value to businesses of all types.
Gather = Logs & Data Warehouse
The goal of this step is to gather and normalize as many little pieces of information about the consumer (employee or customer) as possible from any possible source.
The buzzword today is log analytics, but I’m less interested in discussing logs (which is merely a carrier of event information) than I am talking about business events. These business events provide valuable information about the consumer which must be captured and saved so they can be leveraged to optimize the company/consumer interface.
It is hard to understate the value of gathering all data about the consumer, no matter how inconsequential it may seem at first. This is discussed later in this blog.
Explore = Analytics & Visualization
Once you have the data, you need to explore, visualize and analyze it.
This is where tools such as Kibana and R are helpful, to gain insights about the consumer which can be revealed by the data, and to explore theories about the consumer which may lead to further insights or actionable predictions.
This “explore” phase emphasizes flexibility and data mining. The goal is to identify signals embedded in the data which are highly correlated to consumer needs and consumer insights.
Predict = Relevancy Formula
Now that we have a better handle on the user signals embedded in the data, we can use these signals to make predictions. Typically the goal is to create predictions which lead to success, such as products purchased, useful content delivered, support question answered, satisfied customers, jobs filled, resources matched, successful projects completed, etc.
Big data and machine learning are critical to this process, but so is common sense, and careful evaluation and analysis. It is far too easy to throw a table of signals at a prediction algorithm and just “let the magic happen.” It takes a dedicated and critical mind to evaluate and test the algorithms to ensure against spurious predictions and over-fitting.
The result of the prediction phase is a formula which predicts success. In the search world, this is called a “relevancy formula”, which attempts to predict which content is relevant to the user, based on inputs (aka signals).
Interact = Search & Personalization
The final step is to put the new formula into action. At Search Technologies we have often done this by plugging directly into the search engine using custom operators which, in effect, convert the search engine into a large-scale matching engine (of course, you need access to the search engine source code to achieve this - for example using Solr or Elasticsearch). It is through the real-time search engine that the user interacts with the prediction formula. As the user progresses through the system, their inputs are gathered and provided to the search engine which returns relevant content and relevant, personalized results, which help guide the user through their navigation or purchase activities.
For employees, it is the engine which matches their needs against available content and answers with the goal of providing exactly what they need to do their job, when they need it. In some scenarios (recruiting, customer support), the engine may be integrated with the work activity itself, providing potential answers at the exact moment when they’re needed the most.
Search is Everywhere
What’s amazing to me, as a search person from the ‘90s, is how search plays a substantial role in almost every phase:
From Relevancy Ranking to Predictive Analytics
I have recently been spending more and more time on big data projects for search engine accuracy evaluation (see my white paper Search Accuracy Analysis), and on creating operators for new relevancy ranking models.
An important thing that I’ve always understood (but never explicitly stated, in print) is that relevancy ranking is a form of predictive analytics.
The goal of search engine relevancy ranking, after all, is to produce documents (or products, or web pages) which are relevant to the user. In this sense, the search engine is attempting to predict what documents are relevant to the user at a particular point in time, and based on a particular set of inputs.
What has changed recently is that new big data, statistical analysis and machine learning techniques are revolutionizing the art of relevancy ranking, and turning it into a statistically-valid data science. This is one reason why it is so amazing to be working with search engines right now – during this revolution in big data and statistical analysis.
But even more important is how relevancy ranking and predictive analytics is being tied, more and more, to creating business value:
- It determines what products are shown --> Customer purchases
- It connects people to things they need --> Customer satisfaction
- It provides content of interest --> Customer engagement
Oftentimes I meet with business owners who nod sympathetically when I try and explain how important relevancy ranking is to their customer satisfaction scores, and their bottom line. But clearly I haven’t figured out how to communicate this message effectively, because frequently, projects for engine scoring, metrical analysis, and big data predictive analytics then languish in the “wish list” category.
This is frustrating, because the simple truth is:
Money spent on relevancy ranking and predictive analytics will be returned 10x.
A few years ago, my thinking was: “Relevancy is important, but people don’t do it because it’s too expensive to achieve.” We’ve been working on reducing the costs of engine scoring and relevancy improvements (Open Source helps because we can put new relevancy models directly into the search engine). Today we have more tools, big data frameworks, and processes, so it’s getting less expensive and more do-able every day.
So today, it is probably not a cost issue, and further, the ROI is as clear as ever. Perhaps the problem is simply that predictive analytics is a new, and frankly, a "nerdy" subject. Here are two example customer projects we've worked with recently where a full ROI was achieved within a few months:
I try and explain predictive analytics, relevancy ranking and improvements and I say:
“We did this for customer X and they realized a 7.5% improvement in conversion rate on queries which gave them $4.5m in additional revenue for a $350K investment, an ROI of 12.8x."
"For customer Y we improved their system which realized a 6% gain in sales which garnered $480m in additional revenue for an investment of $2m, delivering an ROI of 240x.”
But then I get blank looks and statements like “oh, that’s very interesting” and I leave the room frustrated by my inability to communicate how much the world is being re-invented around us and Big Data, machine learning and predictive analytics (all inputs to relevancy ranking) are leading the charge and now is the time to start these projects! Not later, now!
The world is being re-invented around us, and big data, machine learning and predictive analytics (all inputs to relevancy ranking) are leading the charge. Now is the time to start these projects!
Anyway, that’s why I write these blogs. To try and spread the word.
But then the data is missing
When business owners are eventually persuaded, we quickly move to a second source of frustration.
"Let's do a big data project!" they say.
“Great!” I respond. “Where’s your data?”
“Uh…” they respond. “I think that Jeremy, might have some? Doesn’t he?”
The problem is that companies are not, historically, data gathering organizations. And so we discover, time and time again, that organizations fall short in having the data that is needed to understand the consumer (employee or customer).
Every project created today should have data gathering as a first-tier requirement. Inside the data is an understanding of the user. With that understanding we can serve them better, increase revenue, reduce expenses, and make happy and productive consumers.
Really, just about every event and interaction should be captured, including:
- Activity on the website (pages viewed, items viewed)
- Activity on other websites and web pages
- Clicks on advertisements, referral links, partner arrangements, etc.
- Purchase activity from digital sources and brick-and-mortar sources
- Knowledge of group membership, where available, and understanding of group activity
- For example, if the user is known to be a member of an organization, social group, office group, ISP, etc.
- Activity from external sources (social media, public information)
- Activity on other business systems and applications
And then the data is incomplete
The next challenge concerns log data. People create a lot of logs, but unfortunately, most of the time, they don't have user IDs (or they are inside the POST data and not sent to the server which does the logging).
The fact that an event occurred is much less powerful when you don’t know who initiated the event. A user ID (encrypted or mangled is OK) is the most useful, a session ID is next best. Only when we can group events into activity sets can we gain insight about the user, to use for personalization.
The problem is that traditionally, logs have been used for justifying expenditures. “We made the changes, and website traffic was up 40%, and abandonment rates were down 5%!”
These are useful aggregate numbers for getting general ideas, but not for understanding user patterns in ways which can be leveraged for personalization, relevancy ranking, or matching.
And so, make sure your logs and data contain all of the obvious metadata fields, especially identifiers (see below).
And make sure your logs are accessible. Recently we’ve encountered problems where logs are locked up in third-party servers (Omniture, Site Catalyst) and are not easily downloaded for the purposes of big data predictive analysis.