Packaging Government Documents
Packages are the atomic unit of data representation within the archive. Each package within the archive and can represent a single printed document, committee report, title of statutory law, day’s-worth of published regulations, or any other logically appropriate document structure.
Packages are physical in nature and represent a grouping of the files which make up a document. Note that the logical representation of the package to the public search user may be different. The mapping from the physical on-disk representation to the logical user-understanding representation is performed by the publish program.
Only fully complete packages are allowed to be stored into the archive for long-term preservation. These packages are called Archival Information Packages (AIPs). To be complete, an AIP must contain:
- All Original Files – All files as originally submitted by the producer
- All Renditions – If multiple renditions of the content exist, these should all be contained
- Complete Descriptive Metadata – This includes all appropriate metadata required to place this package in context within the wider world, and to allow for appropriate access methods by the interested community
- Reference Information – This includes all necessary identifiers (e.g. ISBN numbers, Government docket IDs, report numbers, internal archive identifiers, call letters, etc.) necessary to uniquely identify the package
- Fixity Information – Fixity information includes the necessary digital signatures to ensure that the package contents have not been modified
- Provenance Traceability – Includes a record of all modifications made by the package as far back as is possible, including prior to submission to the archive, if such data is available
- Packaging Information – Includes an inventory of all the files contained within the package, the file formats and encoding structures
In practice, a package is structured as directories and sub-directories of files, including:
- Primary Metadata File – The archive XML file which contains all of the metadata about the package
- Provenance Log – The log of modification events which trace the history of the package
- Secondary Metadata Files – Standards compliant XML files, usually built from the primary file and provenance log
- Submission Directory – A directory of the original files submitted by the producer
- Rendition Directories – A directory of additional renditions created for the package, for example, print PDF renditions of originally submitted Microsoft Word files
- Access Directories – Directories of files specifically for improving search and access to the package
This can include split or joined pieces of the package, as appropriate. For example, if the package represents a magazine, the access directory might include a separate file for every article within the magazine, so that articles can be individually indexed and searched.
What is important about the archive package is that it is complete and fully self-contained. For example, one should be able to ZIP up a package directory (with all of its metadata files, rendition directories, submission files, etc.) and unpack it on some other system and there should be no loss of information.
Offsite Storage, Backups, Mirrors
Precisely because the package structure is fully self contained, packages can be easily exported and distributed to other systems. Note that this is different to typical CMS architectures where important descriptive metadata is stored in relational databases, requiring a complex synchronous transfer of data from two different sources. Instead, the architecture design allows for package contents to be simply gathered into a container (TAR file, ZIP file, or similar) and transferred to the remote system.
In all these cases, a stream of packages and package updates can be distributed as necessary to remote systems. Since the transaction-level of the archive is scoped to the package, all remote systems are guaranteed to be transaction-consistent – either they have ingested the complete package contents or not. There will never be the case where metadata or cataloging does not match the archive contents (a problem which occurs frequently with the replication techniques deployed in most standard CMS architectures).
Search and Access
Search is clearly the de-facto standard for locating information throughout the world. Therefore, any archival system intended for end-user consumption by a designated community must have best-of-breed search and access capabilities in order to fulfill its mission. Simply put, documents which cannot be located and retrieved are no different, from the end-user perspective, than documents which have been destroyed.
The architecture plan contains a number of features designed specifically for enhanced public access:
- File processing methods to prepare documents for search
- Splitting – dividing large documents into appropriately searchable pieces
- Joining – Combining documents together where this benefits the search experience
- Metadata extraction – Using parsers, authority files, cataloging records
- Sophisticated search query structures for handling the needs of both lay-users and academic or expert researchers
- Publishing to the access sub-system is fully designed into the archive from the start
- An emphasis is placed on metadata extraction, to ensure that search and access have all metadata necessary to accurately locate and distribute documents
Example Search Features
The following is a typical list of available search features which can be implemented. The exact combination of features to be used will be determined during initial system assessment in consultation with consumer stake-holders.
This is not a complete list of search features. Many more features are available and can be included if needed.
- Advanced Query Language – Many archive systems are used heavily by academic researchers and members of the library community, who depend on advanced query languages to carefully craft their search requests
- Browse – Browse is a method for scanning through document lists from the archive without performing searches. Browse structures are typically ordered based on the natural ordering for the collection of documents being browsed. Examples include date-order browsing, first-letter title browsing, alphabetical by sub-organization, print-order browsing, etc.
- Type-ahead features – These features make it easier for users to enter metadata into advanced search forms and suggest common queries. Type ahead is a useful technique for helping users to enter metadata values without having to refer to help pages
- Guided Navigation – “Navigators” or “facets” allow search results to be categorized by metadata. They are typically returned in response to searches and provide a simple method for lay users to narrow their results without having to have a prior knowledge of the database structure
- Sorting – Sorted search results are often required by researches to perform exhaustive scans of archive data
- RSS / Atom Feeds – Allow users to use standard RSS or Atom readers to automatically check for updates to the archive
- Content Detail Pages – Present the document (i.e. the Archive package) and its content for public consumption. Often constructed as a transform of the package metadata
- Historical documents search – If documents can be tagged as “is historical”, search can be structured such that default searches are performed only over the latest data. Historical data (for example, older versions of statutory law or agency regulations) would still be available but only from advanced search pages and for research purposes
- Published Search APIs – Search and access APIs can be published for third parties to send searches to the archive. These APIs can be standards-based, for example Z39.50, OAI-PMH, or just a simple RESTful-based interface
- Access IP throttling – If publishing APIs for external search, archives may wish to implement IP throttling so that ungracious users can not dominate system resources