Implementing SolrCloud at the U.S. National Archives
Presented at 2014 Solr Lucene Revolution by Paul Nelson, Chief Architect at Search Technologies
Not many people are facing the problem that the U.S. National Archives and Records Administration (NARA) is facing. Namely, how to scale over the next several years to handle over 1.2 billion documents and 7 petabytes of storage. Once referred to as the "Mount Everest" of data, most of NARA's tens of billions of records had not been searchable or available to the public without physically visiting their facility. Now, all of that is changing.
On November 14th in Washington DC, Search Technologies’ Chief Architect Paul Nelson addressed an audience at the 2014 Solr/Lucene Revolution discussing the challenging aspects of implementing SolrCloud, in the cloud for NARA’s billion+ document archive.
The presentation focused on the architecture and development of the National Archives’ “Online Public Access” (OPA) initiative, a public search interface for browsing both catalog information and on-line content, which is being completely rebuilt using SolrCloud, and hosted in the Cloud.
He addresses the challenges, goals and benefits of the new architecture. Stand-out aspects of the architecture include:
- Content processing for a wide range of content types
- Handling and searching over social media content (tags, comments, transcriptions, translations)
- Scalability to billions of records
- Metadata sensitive search features unique to large publishers and archives