Aspire Content Processing for Large Publishers
Efficient content processing provides large publishers with business advantage
All government and commercial organizations rely on information, to some extent, to support their business objectives.
In publishing, information is the business. It stands to reason that publishers, and especially large publishers, are often in the vanguard when it comes to innovative information processing.
Search Technologies has worked with numerous large publishers during the past 5 years, helping them to implement search systems and address search-related information processing challenges. Our customers include:
- The United States Government Printing Office, whose Federal Digital System is the online system of record for US federal government information.
- CPA global, providers of one of the most sophisticated patent search systems currently available, comprising 80 million patents from 95 world-wide patent offices.
A growing proportion of Search Technologies’ publishing customers use the Aspire content processing framework. This article summarizes the key benefits and motivations for doing so.
What is Content Processing?
Content processing is an important part of any sophisticated search system, but it is particularly important when the system is directly involved with creating or sustaining revenue streams.
Modern search engines rely heavily on metadata to drive value-added search functionality, and content processing plays the pivotal role in capturing, creating, cleaning and normalizing metadata to support search functions.
Content processing is also used to filter or restructure content prior to indexing. A significant proportion of large publishers understood the role of content processing some years ago.
Where the data is structured in nature, commercial ETL tools have been adopted. However, where the data is unstructured “content”, large publishers have not been able to find appropriate commercial products to address this need, and some have consequently developed systems in-house. For these publishers, Aspire provides a commercially supported alternative with a lower overall cost of ownership.
The majority of large publishers have addressed this problem by choosing a search engine with a sophisticated, built-in indexing pipeline (FAST ESP for example). However, there are some important reasons to consider using a stand-alone content processing platform, including improved agility and flexibility, and lower cost of ownership.
A Brief Description of Aspire
Aspire provides a framework-plus-components architecture in which:
- The framework is robust, thread-safe, can be elastically scaled with additional hardware, and supports any number of “pipeline” configurations
- Pipelines are built from one or more components which are individually testable and deployable
- A growing library of ready-made components supports rapid deployment by addressing common content processing tasks
- A range of component “architypes” are available, on which new components can be quickly developed
Aspire works with any search engine or database.
Agility and Flexibility
Publishers are constantly looking for new opportunities to exploit their content assets. For example:
- Addressing new markets, based on sub-sets of existing content
- Providing new services or packages to existing audiences
- Using unique, high-value content within Web marketing initiatives to drive additional, targeted traffic for advertisers, or to increase the subscription base (in a controlled and snippeted form where necessary, to protect subscription revenues)
A high degree of agility and flexibility enables publishers to try new ideas much more readily and at very low cost.
Lower Cost of Ownership
Aspire is a purpose-made system, built to support sophisticated and potentially complex content processing environments.
Aspire was originally developed by Search Technologies to support long-term services engagements with our consulting customers. Low cost-of-ownership has therefore been a key goal for Aspire, from the beginning.
Low cost of ownership is created through a combination of system transparency, and being able to easily divide complex processing tasks into individually deployable and testable components.
Aspire is based on open source technologies, including Maven for code and component management, and OSGi, the pluggable framework for java. Aspire is license fee-free to Search Technologies consulting customers and is maintained under a commercial-grade service level agreement.
- The agility and flexibility provided by Aspire enable publishers to more easily develop new revenue opportunities and improve the exploitation of their content for Web marketing purposes
- The architecture and pricing model of Aspire provides a technical environment in which content processing systems can be continuously developed and maintained for a low total cost of ownership