Aspire Content Processing - Technical Overview
A brief overview follows. A fuller description can be downloaded here
FRAMEWORK & COMPONENT ARCHITECTURE
Aspire comprises a content processing framework plus a growing number of plug-in components, and can be visualized as follows.
Aspire is built on OSGi, a standard for managing dynamic modules in Java. It allows for modules (packaged into Java JARs and called 'bundles') to be dynamically loaded and unloaded and linked to other modules in a live system without requiring shutdown or restart.
OSGi is a mature standard, founded more than ten years ago. Its aim is to be “universal middleware”. The OSGi Foundation is a nonprofit organization coordinated by more than twenty large companies who are committed to, and extensively use java. These include IBM, Oracle, SAP AG, Hitachi and Siemens AG (see full list here).
Features of the the Aspire framework include:
- Automatic threading of document processing jobs
- Dynamic configuration changes including the addition or upgrade of components
- Rich built-in XML processing methods including Xpath and XSLT
- Web-based administration and control interfaces
Aspire is multi-threaded and thread-safe. Its performance scales linearly as hardware is added. Pipelines can be tuned and load-balanced as necessary to meet performance goals.
Aspire is structured as a series of pipelines of document processing components. Each individual component does a potentially small, self-contained task within the pipeline. Components communicate with each other via XML or JSON. Example component tasks include:
- Fetch a document from a file system, with full security credentials
- Analyze a document for a specific data type and extract text into a new xml field
- Categorize a document against a taxonomy
- Associate the document with metadata from an outside source, based on document properties
Cleaning and Quality Control
Search systems perform optimally when fed with clean, consistent data. Aspire provides capabilities to remove unwanted or misleading information prior to indexing. This includes templated information (headers and footers) and menu structures which can cause false positives in search results. It is also important to have methods to measure quality and detect changes to the incoming data. Aspire components contain a built-in “quarantine” system which detects issues and isolates problematic documents. Once content issues are detected, they can be diagnosed and addressed.
New Aspire components can be written in java and fully tested in isolation. Component architypes supporting Groovy scripting are a standard part of Aspire. Search Technologies also maintains a library of ready-made components covering a range of common processing tasks.
- Many projects can be delivered through the configuration of existing components
- Third party products can be encapsulated as Aspire components using a java wrapper
- Components can call on external Web Services to provide processing capabilites
Aspire provides a highly productive environment for creating and maintaining sophisticated systems.
Aspire enables the management of any number of pipelines. Typically, each separate data source or data type will have its own pipeline.
Aspire can also process structured data. Aspire provides off-the-shelf connectors for ingesting database information.