Back to top

Data Capture Recommendations for the Data Driven Company

Paul Nelson
Paul Nelson
Innovation Lead

This is the second part of a blog Analytics, Search and the Data Driven Organization. In the first part, I discussed how search, personalization, relevancy ranking and predictive analysis all play a part in a data driven organization.

Fundamentally, all companies should be gathering and saving all data, on all interactions, with all customers and employees. This data should include:

  • An indication of the event (click, purchase, log-in, comment, search, view, etc.)
  • An identifier of the user who initiated the event
  • An identifier of the session within which the event occurred
  • The nature and characteristics of the event
  • Identifiers of business objects affected by the event
  • Identifier cross-references where required

Often this data is in log files, but sometimes it is captured in other business systems (for example, financial transactions) and stored in the data warehouse.

Fundamentally, it’s all about identifiers. 

And yes, I know how hard it is to have consistent identifiers across an organization. Organizations are so focused on getting the job done (creating and delivering a product or service, and getting paid for it) that things like consistent ID’s and data gathering, which seem tangential, are often left out in the cold.

Let’s take an example from a consumer products company. The following is just a sample of what this company should be recording:

  • Every click on every public-facing web site by every user, with:
    • The session ID if the user is a first-time anonymous user
    • The cookie ID if the user is a returning user
    • The user ID if the user is logged in
    • A method to cross-reference between all of these IDs (e.g. a log-in event with all IDs)
    • The ID of the web page or application item clicked
    • The product ID and/or product-group ID of the item clicked
    • The URL of the source of the click (Google search, referrer URL)
    • The date & time of the click
    • The IP address and estimated location of the user when the click was performed
  • Every add-to-cart & purchase of every product from the eCommerce site, with all of the information above, plus:
    • The ID (or multiple IDs and SKUs, as appropriate) and characteristics of the product purchased
    • The product group
    • Other user-identity information, where available and allowed by law and commercial agreement (address, name, payment information, etc.)
  • Purchases from brick-and-mortar stores, including:
    • The ID (or IDs) and characteristics of the product purchased
    • The product group
    • Other user-identity information where available and allowed by law and commercial agreement
    • The physical location and date-time of the purchase
    • The register location and employee ID who recorded the purchase
  • Downloads and events for mobile-phone apps…
  • Product registration activity…
  • Customer support calls, e-mails, and support events…
  • Twitter mentions, Facebook mentions, fan activity…

And so on.

If consistent identification across all of these sources is impossible, then linking identifiers are the next best thing. For example, an anonymous web site access will start with events identified by session ID, and then linked to user ID once the user logs in.


Data Auditing

Let me take a moment to emphasize that to ensure good and complete data gathering, don’t forget to perform data auditing and sanity checking.

There are all sorts of reasons why data capture might be incomplete, including missing logging hooks, log-rotation, log overwrite, dropped events, network failures, incomplete transfers, disk full, aggregation servers down, etc.

And so, gathered data must be routinely checked. There are several ways to do this:

  1. Sanity checks (counting total number of events, number of events by type, comparison of event counts across systems, etc.) 
  2. Test event injection by performing operations with test accounts and checking the data for those accounts
  3. Checking a subset of events by gathering raw files from the beginning of the event chain, and checking them against post-processed events at the end of the chain (typically in the big data framework).


Exploration & Prediction

At this point, you might be asking yourself, “Well, this is all great, Paul, but exactly how will you be using all this data? What algorithms will you use? How will it improve my interaction with the customer?”

And the truth is, I don’t know. Not exactly. This is where “exploration” comes in.

As we’ve mentioned previously, the raw business and event data feeds the entire process. The steps in exploration and prediction are cyclical, and many cycles may be required to achieve the best possible results.

The goal of exploration is to identify signals that are highly correlated to user interest. The process looks like this:

The steps are as follows:

  • Cleanse and Refine – Raw data is raw, and will contain many irrelevant events (e.g. system monitoring and test events, keep alive events, heart beats, etc.) that need to be cleansed, and are in a wide variety of formats which need to be parsed
  • Group by User – At Search Technologies we are aggressively user focused. The first step is therefore grouping activity by user, usually by session and/or user ID. Often this step requires substantial ID cross referencing
  • Cluster Users – Next, there is usually some method of clustering users; by topic, industry sector, activity set, etc. We like this because it’s easier to assign a new user to a cluster (than to personalize on too little data), and clusters have more activity (e.g. more samples) for better aggregated analysis
  • Explore – Exploration is a free-form stage where various signals are checked against relevancy, satisfaction, and other metrics. This typically involves a lot of plotting and histograms of numbers against any sort of available dimension
  • Predict – Once we have a promising set of signals, we construct a predictive model around them (using machine learning). The model will identify the optimal combination of signals which will predict when an item (piece of content, product, web page, work order, job description, etc.) is relevant to the user. Typically the thing we are predicting against is also available in the logs (relevant content), or in financial data (purchases, add to cart)
  • Evaluate – The predictor must be evaluated (on a new set of data, set aside for this purpose) to determine its accuracy 

Note that this is a continuous improvement cycle. Predictors and signals can (and should be) continuously improved and iteratively evaluated to produce the best results.


Interaction = Search Relevancy & Personalization

In the “interaction” phase, we actually bake all of our insights and hard work into a production product that interacts directly with the end-user (customer or employee).

This will inevitably involve some compromises:

  • Not all types of data may be available for all users (especially brand-new users)
  • Real-time signals may not be computed in exactly the way that we like
  • Prediction models may be too complicated to perform for real-time searches
  • Implementing relevancy and personalization requires changes to actual web sites, which may be under the control of other organizations with different goals and priorities

Search Technologies has been actively working to reduce these barriers by implementing custom operators into search engines to include prediction models for the purposes of personalization, high-accuracy relevancy ranking, and search-driven web sites. These custom operators smooth the deployment process, and provide the critical missing link between the world of big data predictions, and the real-time world of consumer (customer and employee) engagement.


Creating a Metrics-driven, Continuous Improvement Culture

In order to make this all work, how we normally do business in IT must change from a project-orientation to a culture devoted to on-going, metrics-driven, continuous improvement.

Most organizations I meet (and this includes my own company) are “project-based.” They create a project for a new product, a new software tool, a web site refresh, etc., and then implement the project. Projects are socialized, funded and tracked, and when the project is finished there is a post-mortem.

What’s missing in this process is on-going maintenance and continuous improvement. It is unfortunate that most organizations are not “set up” to recognize, appreciate, and therefore fund initiatives which require a sustained, iterative effort.

Or maybe this is human nature? After all, I find it easier to “be the hero” (massive one-time cleanup of the basement) than to be “mister steady” (weekly maintenance and cleanup of the bathroom).

The digital manufacturing floor

But many companies do have an area devoted to continuous improvement, called the manufacturing floor, where their products are produced. Perhaps we need to invent the “digital manufacturing floor”?

Of course, we have “production systems”, where software systems are put on-line and maintained by administrators. But production systems are too often closed off from continuous improvement. They are, instead, intended to be fixed systems, running and responding to real-time requests.

And so, perhaps we need the notion of a digital manufacturing floor, which contains the production systems, but also contains methods for continuous improvement which can be moved into production on a continuous basis.

In other words, a culture which incorporates continuous improvement into the daily production environment.

Testing and Evaluation

Too often, companies reach and interact with consumers on a trial & error basis. New content, new services, new algorithms are pushed to the consumer (customer or employee) without pre-testing, real-time monitoring, or real-time evaluation. The only indications of success are revenue and usage, and by the time these numbers are validated, it is already too late.

In this blog I am advocating for a metrics-driven continuous improvement process. Systems must be tested against user’s past behavior to ensure that future algorithms and presentations will perform at least as well. This requires organizations to re-engineer their digital delivery processes:

  • Gather user data, as much as possible, from everywhere
  • Compute accuracy and statistics for the existing system
  • Compare accuracy and statistics for any proposed system, before it is moved to production, against prior user behavior
  • Perform A/B testing of proposed changes, in production with live users, before it is rolled out to all users
  • Gather and monitor logs or business event streams in real-time
  • Continuously evaluate, improve, test and measure

Unfortunately, very few of the organizations I talk to actually follow these recommendations. Everyone wants to implement A/B testing, for example, but few actually do, and even fewer plan for it and make it a standard practice.

One of the purposes of this blog is to encourage business owners to take these issues seriously. A metrics driven process must be planned for from the start, and have upper-level management support.

Metrics driven evaluation, testing, and continuous improvement are the cornerstones of a data-driven organization. This is a dependable, reliable process which provides measurable, on-going, predictable improvements in revenue and customer satisfaction, with a clear and measurable return on investment.

It is well worth the effort.


Creating the Data Driven Organization

I suppose that by this point you’re wondering, ‘why does a search guy care so much about gathering and using data?’

Well, fundamentally, business data contains the raw materials for understanding the user, and understanding the user is the key to providing great search.

I have always felt that a search engine is that magical pivot point between the mental and the physical:

To me, it’s obvious why data is so important: A better understanding of the user leads to better search. And better search leads to more product purchases, higher user satisfaction, and better brand associations.

And I think we can all agree that 's the ideal scenario.


-- Paul