Back to top

A New Stage In Enterprise Search Indexing Performance

That creaking sound coming from your enterprise network might be the search server starting a full crawl. Scheduling a crawl, either a full or an incremental crawl is not taken lightly by IT professionals. As stated in a Search Technologies white paper Search Accuracy Analytics, “The difficult truth is that all search engines require tuning, and all content requires processing. The more tuning and the more processing you do, the better your search results will be.” 

Crawls for enterprise search engines, such as Solr, Elasticsearch, the Google Search Appliance and SharePoint are essential to this process, although they can look and feel like a denial of services attack! As a result of the demand on network and content repositories, these events can inhibit processes to improve quality and innovate. If it is deemed “too expensive” to reprocess content, to test if a new content processing feature will enhance user satisfaction, this blocks the agility of search engineers to provide users with the most relevant results and an outstanding user experience.

The concept of having a fast local repository that search engines can take advantage of is not a new one; what is new is the availability of low cost and dynamic storage to do this, along with inexpensive and powerful big data tool-stacks such as Cloudera, Elasticsearch ELK, and Apache projects such as Hadoop and Spark. Search Technologies is currently implementing a “Staging Repository”, based on Aspire, as part of the search solution for some of our customers, to support continuous quality practices while making search much friendlier in the IT infrastructure. 

 

Example Staging Repository Architecture

At one of these projects for a large pharmaceutical company, the time taken to crawl and process (including OCR and entity tagging) about 60.000 documents directly from a file system was nearly 5 hours. Using a local staging repository, the same content processing took about 1 hour, despite using a less powerful server that has yet to be tuned for even better performance. Think of what you could do when reprocessing content can be done in minutes rather than hours or days: Add entity identification to support new classification or faceting; support semantic analysis for improved search results quality; bring text image only files into your search through a background OCR process, and so much more!

Your search solution users are constantly asking for a new and improved enterprise search experience, based on their experience with Google, Bing and other Internet search providers. Adding a staging repository can be an important component to help you meet this demand. 

 

Steve Denny & John-Henry Gross

0