Popular Use Cases for the Aspire Content Processing Framework
Aspire content processing can be used for any application which requires unstructured or partially-structured content to be cleaned, filtered, enriched, normalized, or otherwise processed. Aspire also handles structured data easily.
A selection of use cases for Aspire, based on the activities of our current customers, are summarized below.
A BOLT-ON INDEXING PIPELINE
Aspire is used to pre-process content before indexing into search engines such as Solr Lucene, the Google Search Appliance or Amazon CloudSearch. These search engines do not currently provide a full-function indexing pipeline, and in some circumstances, it is advantageous to have one available.
Common uses include:
- Cleaning "dirty data" prior to indexing. This can significantly improve the search experience by removing false-positives
- Extracting metadata from source content to drive search navigation
- Enriching content through entity extraction or categorization to drive search features
- Organizing metadata for results sorting by property (price, distance, date, department, etc.)
- Filtering content before indexing. In other words, being selective about what is allowed into the search index
- Splitting large documents into "sensibly searchable units"
- Aggregating content from multiple sources into "virtual documents" for indexing purposes
SEARCH ENGINE INDEPENDENCE
We have a number of customers who have built custom search applications with Aspire, where a primary customer motivation is to gain search engine independence. For example:
- Customers who know that within a few years they will migrate to a new search engine. Using Aspire, much of the customization investment in the current system can be reused with the new search engine, when the time comes to move, avoiding cost and providing continuity of service
- Customers running multiple search engines can save money by consolidating content processing into a single, independent framework
- Customers using legacy systems such as Verity K2 or RetrievalWare who have invested a significant amount of effort into custom code. Using Aspire, we help customers to capture this investment into Java - the most widely used programming language - as a stepping stone for moving to a new search engine
- To create search engine independent data connectors for important content repositories
Sometimes, large search projects fail because of "complexity creep". Evolving user demands and ever-changing data sets cause patch after patch to be added to index pipeline code. This eventually becomes unmanageable.
This content processing framework was designed to mitigate complexity through providing transparency into processes and a componentized approach to application design. Individual components can be built, tested, deployed and upgraded in isolation.
It is thread-safe, multi-threaded, and supports distributed processing. Its architecture enables index-pipeline bottlenecks to be mitigated. The result is a highly scalable system in terms of both latency and throughput. This benefits search applications where close-to-real-time indexing is needed, or where data volumes are very large.
HIGHLY CUSTOMIZED SEARCH SOLUTIONS
Where practical and affordable, customers prefer to use commercial off-the-shelf software. Usually, some configuration or customization of COTS systems is necessary and the degree of customization required depends on the environment and the customers' specific needs:
- How diverse / incompatible / poor quality / fast evolving are the content sources?
- How diverse are the users' requirements? Some users may be very specialized, while others "just want it to work like Google"
- How complex is the security regime?
- How many "exceptions" need to be handled? Experienced builders of enterprise search systems understand that managing exceptions is time-consuming and potentially problematic, regardless of which search engine is being used
Aspire content processing appeals to experienced enterprise search implementers who know that managing processes efficiently, over the lifetime of the system, is the key to ROI. Aspire provides a flexible, agile and pragmatic alternative for building highly customized search applications on the leading search engines.
THE INFORMATION ACCESS LAYER
Some customers see advantages in developing an independent information access layer which serves not only search, but also content management, business insight and other "big data" applications that consume unstructured content. The content processing necessary to support search applications - cleaning, enriching and normalizing - serves these other applications equally well.