Improving Indexing Speed using a Big Data Architecture
Large reductions in the time taken to re-index provides search systems with increased agility
With search systems, it is necessary from time-to-time to re-index the whole document corpus. For example, when:
- Indexes become corrupt
- New features are added to the search engine
- New or different content processing capabilities are added
- Content normalization issues are found and need fixing
THE TRADITIONAL APPROACH
In almost all currently deployed search systems, a full or even partial re-index requires the whole process to be repeated, including the extraction of each document from its repository.
Real-world indexing speeds can be limited by a number of factors, although the raw indexing capability of the search engine is seldom one of the limits. Common limiting factors include:
- Extracting documents from the repository
- Opening documents (for example, to extract text from a PDF file)
For example, extracting Microsoft Office documents from a typical Content Management System, running on a separate sub-net, will typically limit the indexing rate to just a few documents per second. This means that, for example, it will take a number of weeks to re-index a corpus of 10 million documents.
THE DATA CACHING APPROACH
Where extraction from repositories is the primary limiting factor (and it often is), then caching the data outside of the repositories, in a more easily accesible yet fully secure structure, provides significant gains. Hadoop can be used for this purpose.
Combine Hadoop caching with a traditional indexing approach, and indexing rates of a few hundred documents per second can be achieved. This means that 10 million documents will be re-indexed in about a day.
This means running the indexing process as part of a Big Data Map/Reduce job. This can be accomplished with any search engine which can install and launch an indexing engine within the Big Data framework.
Using an integrated approach, indexing speeds of a few thousand documents per second can be achieved, reducing the time to index 10 million documents to just a few hours.
TAKING THE LOAD OFF INFRASTRUCTURE
Re-indexing from source imposes a heavy burden, both on repositories, and on network bandwidth in general. The cached and integrated approaches summarized above greatly relieve this stress.
THE IMPORTANCE OF AGILITY
The ability to re-index quickly and without causing additional stress to corporate infrastructure, increases search system agility. This means, for example:
- Content processing fixes and improvements can be implemented immediately
- Large indexing tasks do not need to be scheduled months in advance
- Weekly sprints, creating a brand new index every week, can be extremely helpful during initial project implementation, and support an ongoing philosophy of continuous incremental improvement
Contact us for an informal discussion about improving your indexing speed.