click to search this site
Hadoop Solr Integration

Metadata Capture: Don’t Miss the Low Hanging Fruit

Metadata is important in most search applications, and easily captured metadata can greatly enhance the user experience

The majority of cool new functions in a typical search results interface rely on the presence of appropriate metadata. Yet in many search engine implementations, simple opportunities to capture existing metadata are missed. Although there are a variety of reasons why this occurs, it is perhaps most common where the customer has a laser-like focus on plug-and-play.

The moral, if there is one, might be this: Just because you can plug-and-play, it doesn’t mean that you should.

This article was provoked by a recently witnessed extreme version of a common scenario.

Perhaps the most common repository type, when it comes to data sources, is the ubiquitous file share. In your average enterprise search scenario, the file share holds more indexable content than any other source (although the value of the content may be relatively low). The customer in question had already implemented a plug-and-play data connector – which was working perfectly. The search experience however, was poor, and they were loathed to roll the system out, beyond a test phase. At this stage, we became involved.

The customer thought they had a relevancy problem, and that we should tune the parameters of the search engine’s algorithm to solve it. Doing so may have made some progress, but it was not the core issue. The data set was about 7 million documents occupying 4TB of space. We examined the data in detail (using Aspire-based automated tools) at project initiation. In doing so, we made a number of insightful observations about the data set which might be the subject of a follow-on article, but here, let’s focus on the metadata.

There were more than 700,000 directories on the file share, and embedded in many of the pathnames was useful metadata. A typical file path was of the form:


Of course, with a plug-and-play approach, none of that metadata was being captured. What makes this an extreme example is the nature of certain files, which in the customer’s opinion were very important to search users. These spreadsheet files:
  • Were many gigabytes in size (I personally had never seen such large spreadsheets, and the customer had hundreds that were 5GB+)
  • Contained no useful text for indexing purposes – just a lot of numbers. For example, within a typical large .xls file, or in its title, the company name was not mentioned, there was only generic text – words such as balance, consolidated, total, etc.
So, although the search indexes were being clogged by hundreds of millions of numbers which are very seldom searched over – meaning that more resources were needed to run the system – these documents would never be found by a real-world search.

The solution was simple – capture the file path information into metadata and associate it with the file. This was done after a little thought, and in addition to indexing text from the file path, search navigation options were provided based on captured metadata.

It was necessary to take a different approach to metadata acquisition from different parts of the file share. Partitioning was possible at a high level, and four basic strategies resulted, ranging from capturing metadata and including it in navigators, to ignoring file paths on areas of the disk where they added no value.

The implementation services involved were modest, compared to the total system cost. Search speed was improved a little, and user satisfaction was transformed.
Premium Data Connectors