Back to top

A Search-centric Architecture for Government Archives

This page presents an introduction to a white paper which can be downloaded here.

The whitepaper presents a comprehensive architectural plan for implementing a standards-based government documents archive using open source components specifically designed for secure, long-term preservation and ease of electronic search, access, and dissemination.

This architecture plan differentiates itself in a number of important ways:

  • It is proven. This plan has been implemented and is in production today in major government organizations, most notably the United States Government Printing Office
  • It is standards based. Archive standards such as the Open Archival Information Standard (OAIS) and metadata standards such as those from the Library of Congress (also a Search Technologies customer) are integral to the design
  • Search quality is a key component of the architecture. A substantial portion of the architecture is dedicated to providing the safest, highest quality search, distribution, and archive access
  • It can be implemented entirely with open-source components so no software license purchases are necessary
  • It is designed for long-term preservation of your documents
  • It includes automated methods for document cataloging and metadata extraction
  • It includes an implementation methodology to ensure smooth and reliable delivery and deployment of the system, which achieves the highest possible quality for each and every collection of documents
  • It anticipates and incorporates the needs of multiple communities of interest, including researchers, lawyers, government employees and interested parties from the community at large

Government organizations that adopt this plan will realize key benefits to support their mission goals, including:

  • Lower risk:  Documents are safe in the archive. Modifications to documents are carefully tracked and controlled
  • Reduced printing costs:  Electronic dissemination from the archive eliminates the need for physical dissemination of documents for many types of collections, lowering or eliminating printing costs
  • Increased content utilization and wider dissemination of information:  This plan contains the highest quality, best-in-class search and access methods, to ensure that all interested communities can easily locate and access documents
  • Reduced document preparation costs:  Finding aids (indexes, table of contents, etc.) can be generated automatically from metadata, or may no longer be required in an electronic format
  • Reduced cataloging costs:  Metadata can often be extracted directly and used to catalog documents, reducing or eliminating manual cataloging efforts
  • Open source increases longevity and reduces royalty costs:  The use of non-proprietary, open-source components allows for the archiving of the foundation software along with the archive data. This enables the archive itself to be archived, and recreated as needed on future machines and hardware

A key aspect of this plan is its heavy reliance on accepted archive, cataloging, and metadata standards. Standards compliance means the archive will communicate via established and documented protocols, and this will maximize the longevity of the archive.

The primary standards incorporated into this architecture include:

  • OAIS – The Open Archival Information System reference model
  • MODS/MARC – Library standards for descriptive metadata
  • PREMIS – The preservation metadata standard
  • XML – for metadata and markup
  • XML Schema (XSD) – for metadata validation

Other, optional standards include:

  • PDF/A – The PDF standard for long-term preservation
  • XHTML – For low-resolution, textual content representation
  • METS – Metadata Encoding and Transmission Standard for document contents packaging
  • Z39.50 – For inter-library access to the archive catalog
  • OAI-PMH – The Open Archives Initiative Protocol for Metadata Harvesting
  • RSS/Atom – For dissemination of archive updates
  • XML-FO – For formatting XML content into print-ready presentation renditions

These standards are specified as “optional” since their usage will depend on the particular requirements and needs of the organization who is hosting the archive and the needs of the designated community.