Back to top

A File System Staging Repository for Search Engines

A Modern Architecture for Faster Content Processing and a Seamless Search Experience

Derek Rodriguez
Derek Rodriguez
Technical Architect Sr. Manager

How a file system staging repository is leveraged to reduce content processing time from days or weeks to minutes or seconds, ultimately improving users' search experience?

Watch our customer's success story told from our architect's first-hand experience.



The Challenge of the Traditional Search Architecture

Let’s consider a typical search application architecture and its components.

  • Content acquisition stages crawl and harvest content. 
  • Content processing stages prepare content for indexing in the search engine.
  • Search engine provides indexing and query services over the content.
  • And the search application allows users to search and browse the indexed content. 

This traditional architecture is a straightforward and commonly used architecture, but it does have a couple of limitations.

First, the content acquisition and content processing stages are tightly coupled, which means that reindexing content always requires recrawling the original content sources. Imagine the massive amount of business data today and the time it would take to recrawl all that content. Depending on the volume of your content or the capacity of your host system, recrawling content may take days or weeks. And you’ve guessed it, businesses and customers don't want (and can't afford) to wait that long to have their content ready for search. In addition, since a full recrawl may cause disruptions to business processes given the time and IT resources allocated to it, this architecture may delay or prohibit search engine enhancements tasks, including engine scoring and relevancy tuning.  

The second challenge of this architecture is a disconnect between the search application and user experience. It’s very likely that you’ve experienced this situation before: after you enter a term in a website’s search box and hit that “search” button, you’re taken to a completely separate window to view the search results. This disconnection can frustrate users who are so accustomed to intuitive search interfaces like Google, and even more so if they can’t find what they need in the first attempt.


New Users Expectations, Democratization of Data, and the Rise of Staging Repositories

Now let’s think about your own search experience on Google or Amazon. Everything is so seamless and easy, isn’t it? Indeed, these giants have set the standards for today’s search experience - users have come to expect one single application and intuitive interface for searching, browsing, and viewing results. This is also a reason why many enterprise search applications seem to lag behind. 

Our Aspire team solved this problem by developing a File System Staging Repository service. The concept of having a staging repository for faster content processing has been used in the search industry; but it’s gaining significance now as the new age of the “cloud” brings massive data, greater low-cost and dynamic storage options, along with powerful open source big data tools, such as Cloudera, Elastic ELK stack, or Apache projects like Hadoop and Spark.


How a File System Staging Repository Architecture Performs Differently?

The Aspire File System Staging Repository addresses the challenges of the traditional search architecture in order to deliver a fast, seamless indexing and search experience. Here's how it works:

aspire file system staging repository

  • Adding the File System Staging Repository to an architecture allows us to decouple content acquisition and content processing stages. In the content acquisition stage, Aspire’s connectors acquire content from multiple data sources and store it in the File System Staging Repository. 
  • Aspire’s content processing pipeline reads content from the respository and prepares it for indexing in an organization's existing search engine. The File System Staging Repository is local to Aspire and so content can be re-indexed very quickly. No additional recrawling of content is needed. Rather than taking days or weeks, indexing can be completed in minutes or seconds.  
  • The File System Staging Repository also provides services to search applications. Search applications can request content from the repository (such as HTML, PDF, and Office documents) for rendering to users. With the File System Staging Repository, applications can now provide an integrated search, browse, and content experience in a single search interface.


In Action: Faster Indexing and Better Search for the Library of Congress' Cataloger's Desktop Application 

The File System Staging Repository is a game changer for the Library of Congress' search application.

This architecture allows Aspire’s connectors to harvest content, whether it be file systems, web crawls, or other document types, and store them in the File System Staging Repository, allowing for very efficient re-indexing anytime it needed.

It crawls over 300+ resources, indexes them, and makes them available for unified search, browsing, and viewing. The efficiency of the new staging system allows the Library of Congress to improve the efficiency of content crawls and keep Cataloger’s Desktop up to date for 10,000+ librarians at 1,000+ subscribing institutions worldwide

Learn more about our search improvement projects for the Library of Congress' Cataloger’s Desktop here.

--  Derek