Enhancing the Library of Congress Cataloger's Desktop with Log Analytics and Advanced Search Features
The Library of Congress ("LC") is the oldest federal cultural institution and the largest library in the world, with millions of books, recordings, photographs, maps and manuscripts in its collections. LC also provides leadership to libraries throughout the world. As part of its mission to provide services to the library community the Library of Congress developed the Cataloger’s Desktop (“Desktop”) application, which is a searchable information delivery system consisting of 300+ pre-selected information sources. It is an authoritative service that’s widely used by over 10,000 librarians at approximately 1,000 subscribing institutions worldwide.
Since we started to support the Cataloger’s Desktop in 2009, we’ve been helping LC implement cutting-edge enhancements to provide a better search experience to Desktop users. After moving Desktop from a legacy search platform to the open source Solr search engine in 2014, we continued making improvements to the application and recently released a number of updates, including log analytics, query suggestions, and metadata enhancement.
System Auditing and Log Analytics with the Elastic Stack
Soon after migrating Desktop from FAST to Solr, we implemented log analytics and system audit capabilities using Elasticsearch.
Data model design for log analytics
While log analytics seems cut-and-dry, proper planning is needed to ensure the solution meets requirements. At Search Technologies, we use a process called Search Data Model Design to plan all of our projects, including our log analytics projects, because search engines drive our analytics solutions.
At the beginning of this project, we spent time learning about and understanding LC’s reporting and audit needs so we could design the correct solution. We discovered requirements in four areas:
- End-user search behaviors
- Resource usage
- Concurrent and peak login activity
- Account access behavior that appeared to be intrusion events, like recurring failed login attempts and locked accounts
Once we understood the requirements, we created the search data model which defines each and every aspect of the logs to be captured in the solution. This information is used to build the Elasticsearch template for our analytics indexes.
Next, we had to validate our system and application logs to make sure we were capturing the right information. Many logs required no changes. The application logging facility needed some changes to add the right fields for tracking various end-user behaviors within and across sessions.
With proper planning and log formats behind us, we could ingest content and generate reports. This architecture uses several Elasticsearch components for log ingestion:
- FileBeats – Acquires messages from logs in real-time and routes them to Logstash
- Logstash – Logstash instances accept the raw messages from FileBeats and convert portions of each log into search engine fields published to Elasticsearch.
For this project, we needed to generate regular reports to be sent by email to our customer and also to our Managed Services team. A python query and reporting framework serves this need quite well. The Elasticsearch Curator tool helps keep the Log Analytics Elasticsearch indexes optimized and archived.
Custom reporting components deliver monthly and on-demand reports of system usage to LC, including:
- Resource usage (number of documents viewed per resource) – the Cataloger’s Desktop indexes over 200,000 documents from 318 sources and websites.
- Most common queries for understanding user needs.
- Zero-match queries – these reports are helpful for identifying search problems and trends in user interest.
- Synonym expansions – these are helpful for confirming that the synonym features are working properly.
Further system auditing and reporting deliver notifications of aggregate usage of the system and anomalous events to LC and Search Technologies' Managed Services team:
- Peak concurrent usage per day and by licensing institution
- Password lockouts and frequent password changes, as possible indications of intrusion attempts
Proper planning and the Elastic Stack made this a straightforward implementation that is easy to manage and supports the needs of our customer and the Managed Services team.
Query and Title Suggestions
During 2016, Search Technologies implemented a Query and Title Suggestions Service for Desktop using Elasticsearch. This solution provides Desktop users with type-ahead suggestions of query terms and resources matching keystrokes entered during search.
Planning and design
Planning included requirements gathering to understand customer needs and generating mockups to depict the eventual implementation. Once again we pursued a Data Model Design process to ensure that the Elasticsearch index templates and query rules were optimized to support the requirements.
Broadly speaking, the requirements for this project were to:
- Suggest queries that are guaranteed to succeed
- Execute the user’s query when the user clicks the selection
- Suggest titles that match users’ search terms
- Redirect the user to the title landing page on-click
- The type-ahead solution should be flexible and forgiving when users misspell terms using fuzzy match logic
- Suggestions should be returned in relevancy order
- Matches of whole tokens should come first
- Boost popular queries slightly
Suggestions content preparation
This includes several automated steps to ensure the suggestions remain fresh and accurate:
- Python-based workflows acquire suggestions from the log analytics architecture (see above) and content metadata
- Duplicates are removed, terms are cleansed, and stopword lists are enforced.
- Configurable update cycle (daily, weekly, hourly, etc.)
- Customizable pruning mechanism to exclude old queries from suggestions
- Desktop excludes queries older than 90 days from the suggestions index
- Old indexes are archived and then deleted using Elastic Curator
Query and Title Suggestions Workflow
Desktop’s Suggestions Service launched in the summer of 2016 and has been helping end-users get to needed content quickly.
Query and Title Suggestions Type-Ahead Presentation
During the Fall of 2016, we leveraged Aspire Text Analytics Components to enrich Metadata in Desktop. Aspire is a search engine independent content processing framework developed by Search Technologies to handle unstructured data.
The goal of the project was to help users find cataloging documentation related to various types of materials such as books, maps, electronic resources, music, or artworks.
The project included three steps:
- Acquiring terminologies and vocabularies describing types of materials from the field of librarianship.
- Ultimately, 37 sources were acquired and cleansed during expert review.
- Aspire entity extraction methods identify documents that contain the terms or phrases:
- Adding Material type metadata to the search engine document, (e.g. books, maps, music etc.).
- Adding the matched terms to a search engine ‘entities’ metadata field (e.g. gilt binding, geodetic datum, concerto).
- The process executes during ordinary content crawls and requires no extra processing.
- User interface and relevancy improvements
- Generating a user interface facet based on Material Type metadata
- Providing a relevancy boost for matches on the entities' tags
- Publishing the matching entities to query suggestions
The recent enhancements enable users to find the right information faster and easier. In particular,
- Topics regarding specific material types are extracted from the content and made more visible to users.
- These topics drive facets to help users find needed information quickly.
- Query suggestions are enriched with terminologies from the discipline, also helping users find needed information quickly.
Three Enhancements Working Together
These three mini-projects served different purposes, but they work together:
- Log Analytics captures query logs for reporting.
- Metadata Enhancement enriches content in the main search engine through entity extraction.
- The Query Suggestions feature benefits from the two other projects, drawing successful queries from Log Analytics as well as Metadata from the content added in the enhancement project.
Email Notifications for Saved Searches
Recently, we added email notifications, also called “Alerts” to Cataloger’s Desktop. For years, end users have been able to save favorite searches, bookmarks, and shortcuts in Cataloger’s Desktop. With Alerts, when a user saves a search, they are presented with the opportunity to set an alert frequency, such as Monthly, Quarterly, or not at all.
Each month or quarter, an automated job retrieves all saved alerts, executes the associated saved searches to identify documents indexed since the last alert, and sends citations to the email address on file for the user.
This feature offers Cataloger’s Desktop users a new way to personalize their experience and use the system to keep up-to-date in the field.
Thanks for reading this blog post about our recent search enhancements for the LC Cataloger’s Desktop project. We’re working on a Recommender for Desktop scheduled for release later this year. Stay tuned for our next update!