Big Data, Personalization and the No-Search of Tomorrow
The Future of Search is Personalization
We have always been thinking about personalization for search but rarely has it ever been implemented.
I think the primary reason for this is that it has always felt so ad-hoc: tweaking relevancy ranking parameters based on the user’s location, past history, job title, etc. How would we know if we were making things better or worse? Maybe we would be providing documents that were based on personalization factors, thinking we were doing the user a favor, but instead we just got in the way?
But a lot has changed since then.
First, Google has started to personalize their searches. And whatever Google does, the rest of the search world follows.
Second, customers are now designing personalized results into their user interfaces. Personalization is becoming a requirement.
Third, we now have Big Data, which provides a statistically valid process for optimizing how results are personalized. We can now create the formulas and have confidence that they will improve the user’s experience and provide a positive ROI.
And finally, we are creating our own relevancy models and putting those models directly into the search engine (with custom search operators) so that the results being served from the engine are automatically sorted by probability of relevancy. While not available in all engines, these relevancy models have become increasingly more powerful and sophisticated. No longer are they simple ADD & MUL operators with TF/IDF, but now we have all sorts of functions and mathematics, including vector comparisons.
And so now we enter the age of personalized search, big data, and “matching engines”.
The Process to Personalization
Search Technologies has been using big data processing and machine learning to improve search results for a while now, but it is only recently that we have come to understand that these same techniques can be used for personalization. Further, we believe this elevates the process of personalization from an ad-hoc, trial-and-error process into a statistically valid, data driven process which can deliver consistent and measurable return on investment.
In our view, there are two steps towards implementing personalization.
Step 1: Use Historical Data to Generate Statistical Models
This step requires gathering historical data from all users and using this data within a big data framework to generate statistical models which predict the probability that a user will find an item (document, web page, product, destination, etc.) to be useful.
The process is cyclical with evaluation steps at every juncture:
“Signals” in the above process can include vectors, scalars, clusters, categories, etc. that represent (as much as possible) dimensions of personalization which correlate highly to probability of interest. Note that generating good signals usually represents 90+% of the total effort for the project.
Note also that the process is on-going, cyclical, and requires evaluation at every step. This is a data mining process which requires a deep understanding of the user population and the content offering of the system. Insights will lead to statistical models, which lead to more insights, which lead to better models in an on-going process of refinement and increased relevancy.
Step 2: Implement Personalization
Once valid, verified statistical models for personalization are available, they need to be used, in real time, to personalize the results for individual users. This is done as follows:
Ideally, all of these steps should occur as the user progresses through the system. Historical data is combined with data from the initial connection, and then every event (click, search, purchase, etc.) adds to the model and generates a new set of signals which are added to the search & match engine.
In practice, some signals may be too onerous to compute in real time and may need to be done in the background, for example, as a periodic batch process which prepares signals for the user’s next encounter with the system.
This is where the rubber meets the road. Generating high quality signals which correlate well to the user’s level of interest, and which can be computed in real time is a non-trivial problem which requires ingenuity, patience, and constant evaluation.
A New Type of Search
In this new, hyper-personalized world, we are no longer doing traditional search where the user enters a few keywords, and all that we do is find documents matching those keywords.
Now we are constructing a series of “signals” and matching all content to these signals in real time. Signals can be:
- Weighted Vectors of tokens
- These can be lists of products, lists of items clicked, lists of categories, industry sectors, topical interests, etc.
- Items in the list are weighted to represent their estimated importance to the user.
- Simple single-dimensional scalars
- Time on the system, date and time of last interaction, number of interactions, interactions per second, etc. are available to all systems.
- Other implementations will have numbers such as salary, age, total product spend, average product spend per year, number of items purchased, etc.
- Complex, multi-dimensional scalars
- This category includes multi-dimensional factors such as location (latitude, longitude).
And, of course, the search terms entered by the user.
These signals are then combined into a prediction formula which is determined using statistical analysis in a big data machine. The formula combines the signals (often using non-linear methods such as decision trees or support vector machines) to create a single prediction of probability of relevancy, a percentage number between 0.0 and 1.0, which determines the estimated likelihood that the user will find the content relevant.
Personalization requires Data on the Person
The recommended process for personalizing search results (see above) starts with gathering as much data as possible about all users.
Ideally, all possible interactions with all users are monitored, recorded, gathered, cleansed, and normalized from all possible sources, and aggregated together into a series of signals creating a view of the user which is then used as the search.
This should include as much of the following as possible:
- Personal data
- Current location, home location, age, gender, initial contact date, etc. (whatever is available and allowed to be gathered)
- If the customer is an employee, this can also include employee-business data such as seniority, office location, manager, business unit, office group, job title, etc.
- Activity on the web site (pages viewed, items viewed)
- Activity on other web sites and web pages
- Clicks on advertisements, referral links, partner arrangements, etc.
- Purchase activity from digital sources and from brick-and-mortar sources
- Knowledge of group membership, where available, and understanding of group activity
- For example, if the user is known to be a member of an organization, social group, office group, ISP, etc.
- Activity from external sources (social media, public information)
- Activity on other business systems and applications
Suddenly we have gone from too little data to too much data! Considering a user’s entire history on-line, every click, comment, document, app download, past search, etc. we now have a wealth of data which can be mined to learn more about that person.
All of this data is converted into a “query” (of sorts). Note that this “query” is not at all like traditional queries (a few keywords), instead it is a series of signals, vectors, and categories / cluster assignments, which together provide a holistic representation of the user for use (along with the user's query) in relevancy ranking the results based on personal interests.
Search, not Batch
Many organizations are pursuing elements of personalization through pre-processing. In other words, they are using Hadoop big data machines to analyze all users and interactions in large batch jobs to identify products and content for users. Recommendations engines are a prime example of this approach.
Search in the “batch” world still has a role, but it is primarily for lookup of pre-computed results and not for true relevancy ranking.
Use Search for Dynamic Personalization
The problem with the batch approach is that it is insufficiently dynamic. It depends on big data jobs which are re-computed every evening, or every weekend. This is a problem for new users just entering the system, or for users whose recent activity is a diversion from historical patterns (which is all of us, really, since when do we go to a website to do the exact same thing over and over?).
Further, it tends to look at a narrow dimension of inputs (because these are mathematically tractable at large scale) rather than the sum-total of the user’s activity (containing many signals and dimensions).
Search neatly solves these problems by dynamically building queries with an emphasis on the user’s most recent activity. As the user progresses through the system, we build up an on-going and up-to-the-click profile of vectors, signals, category assignments etc. which is used to relevancy rank all content for the user.
Much more than just “Search”
People think of search engines in terms of “searching for stuff”, where the word “search” implies “I am missing something, and I am actively looking for it.”
But such an understanding of a “search engine” is much too constraining for today’s machines. The word “search” is limiting our abilities to see what these machines can do.
The truth is, what we call “Search Engines” are really massive-scale matching engines. They match a description of a desired outcome to a vast database of possible answers. They do this in fractions of a second (or a small number of seconds) with large and complex inputs.
Further, the “description” to these engines can be large, complex, poorly formed, incomplete and at least partly erroneous. The matching engine will match what it can, as best it can, and provide the results which are the most likely to be of interest to the user.
The Future of Search is No Search
One of the reasons I have become so enamored with personalization is that I see it as being a pivotal business process for handling a critical trend in the search engine industry.
Specifically, queries are getting smaller:
- 1970 – 1993: Large, complex Boolean queries
- 1993 – 2010: Two or three keywords
- 2010 – Today: Two or three characters
- Tomorrow: Nothing
It’s true! I believe that:
The Future of “Search” is No Search at all.
Of course I’m talking about “average” or “most frequent” usage. After all, large and complex Boolean queries are still being used today in academia and government intelligence. But most people today expect to be able to launch a search box and just enter a few keywords.
And now, people launch the search box and enter just a few characters. Matching queries or matching results are shown in a drop-down underneath the search box. This feature goes by many, many names (we really should standardize), including “type ahead”, “query completion”, “predictive search”, and “search suggestions”.
So, over time we have gone from queries with 100’s of characters down to queries with just a few characters. How is this possible? The answer is; relevancy ranking. As the size of the query decreases, systems become more and more dependent on relevancy ranking:
In such an environment, the only way to be successful is 1) gather lots of data, 2) use big data to understand the user’s needs, and 3) implement high-quality, probabilistic relevancy models.
This is the only way we can possibly get down to queries which require no search at all.
Search without the Search Box: Getting to the “No Search” Search
There are two parts to create the “No Search” search:
- Gather inputs from other sources (other than the search box).
- Create a truly fabulous relevancy ranking algorithm, using personalization and big data.
Once you have done these two steps, you can relevancy rank the entire database. In this way, we can provide search results without any search whatsoever.
This is how we get down to doing searches without queries. Users no longer enter a search at all. Instead, the machine simply ranks the entire database (every single document) for the user and gives them the results.
Implications of the “No-Search” Search
Fully eliminating the search box has major implications in terms of business structure, and the relationship between the consumer (customer or employee) and the organization. It is a sea change which opens up all sorts of new possibilities.
From Passive/Pull to Active/Push [Change in Relationship]
No longer depend on your consumer to do a search, instead actively send them search and match results.
If the consumer is an internal employee, don’t depend on the employee to do the search as needed. Instead, automatically provide results for them to do their job when and where they need them. Eliminate training on search. Simplify processes.
For example, current systems depend on the employee to sit down, figure out what to do, and then do a search for the materials required. New systems will tell the employee what to do and will provide all of the materials they need,automatically, to help them do their job.
If the consumer is a customer, lower the barriers and gates to connecting them with content and products. Eliminate the need for them to formulate, enter and submit a search. Provide them relevant results as part of normal activity. Provide highly targeted streams and notifications to the user as they go through their day and understand the exact likelihood that the user will find the content to be relevant.
Changing to a System of Continuous Improvement [Change in Business Structure]
Most organizations today are hampered by a “project orientation.” Projects are identified, researched, estimated, and socialized within the organization. They are budgeted for a fiscal year, implemented, completed, and moved to production.
The problem with this approach is two-fold: 1) Understanding and leveraging digital data for personalization is a difficult task which requires a substantial amount of data research, data mining, and experimentation, and 2) The business environment and the consumer are always changing, and therefore so must the algorithms.
And so we recommend for organizations to commit to a funded, on-going process of metrical analysis (engine scoring) and continuous improvement. Instead of a “search project”, we recommend an “on-going search improvement / personalization process.” Only with such a process can an organizations tap the full potential that comes with the deep understanding and predictive capability of your consumers that we are recommending.
Give up Control to the Consumer [Change in Relationship]
When we discuss using the computer to optimize the interaction with the user, often we get a reaction along the lines of: “Let the machine determine what products to show? Let the machine determine how to organize the web page? Are you kidding?” Organizations have a hard time giving up control to an abstract relevancy ranking engine that decides what to present to the user, when, and where.
But this is the wrong way to think of it. After all, the machine (when working properly) is a mathematical representation of the user and the user’s behavior. It is a mirror to the user, if you will, determining the user’s desires based on the user’s digital activity and digital interactions.
And so, it is not really the “machine” that decides, but the consumer. The machine simply reflects the consumer’s expressed desire. And so we are not talking about corporations giving up control to a relevancy ranking machine, but instead, giving up control to their consumers, and letting the customer say “this is what I want”. In this world, the user is king (or queen, or princess) and the machine is optimized to give the user what they want.
But still, large, traditional, retail and manufacturing organizations are uncomfortable with this new dynamic. They have experts who determine the organization of the store, the placement of products, the prioritization of products presented to the consumer, the placement of boxes on the shelf, etc. They have extended this approach to the digital world where companies feel they must manually control all aspects of their web and mobile presentation to the customer.
But the problem is that digital interactions are too vast, too complex, too hidden for humans to understand. We used to be able to study video of dozens of people as they walked through a store. We would watch focus groups on monitors to see how they interact with the product.
Such strategies no longer work in the digital world, where every website can instantly attract millions of customer interactions. When swamped with such data, manual approaches are insufficient to capture the richness of interaction and the wide diversity of storylines, use cases, and personal goals. Only a machine, using big data techniques, can do this.
This is not to downplay the importance of the brand, and the positive value perceptions it embodies in the mind of the user. The brand and the value proposition are critical to the long-term strategy and health of the organization. If we always catered to the basest instincts of the user, the lowest common denominator, then where would we be? All selling touristy trinkets and T-Shirts probably (or worse).
No, instead our goal is to elevate the discussion to a higher level. Let humans make the big decisions and let the machines make the small decisions. Humans should devise the metrics which achieve the desired goals (revenue, value, brand), and machines should be used optimize consumer digital interactions to meet those goals.
In other words, don’t sweat the small stuff. That’s what we have computers for.
Personalization is Now, Personalization is Mobile
By far the primary driver of personalization today is the mobile device.
If you have a large computer screen and you can enter long text queries, then personalization and top-quality relevancy scoring is less important. Users can enter long queries with many terms, execute multiple queries if the first one doesn’t work right, and can be presented with many scrollable pages of results with hundreds of content items.
This paradigm completely breaks down with mobile devices:
- Entering text queries on mobile devices is awkward and error prone.
- Mobile devices can only show a very small amount of content.
With a mobile device you are lucky to get one or two search characters (and you would prefer zero characters), and the amount of content you can show is one or (maybe) two items.
In such a world, what is a search engine to do? Only with the best possible relevancy ranking, which requires personalization, can organizations be successful in such a difficult environment.
Let’s get started
I’ve tried, in this blog article, to sketch out a vision of how personalization will take over the world of search. Today, we have personalization, of sorts, but only to the “average” user, and it is ad-hoc, single-dimensional, and (typically) untested and not statistically valid.
Recently, Search Technologies has been pushing the boundaries of search into fully-fledged matching engines using big data to create signals for matching, and creating new search engine operators to process these statistical models directly inside the search engine.
It is only recently that I have come to realize how these technologies are now perfectly positioned to provide a huge leap forward in search, using results personalization, the consequences of which could completely re-invent the relationship between the user and the search engine.
And while Search Technologies may have not created a personalization engine for your exact industry, user model, or content model, but I'm confident that our experience with matching engines, custom relevancy models, and big data, can enable new processes, through which these amazing new personalization engines can be created.
So let’s get started.