Aspire + Azure Cognitive Services: Transforming Unstructured Data Preparation for Azure Search
Demo and Reference Architecture
Amid the rise of cloud-based search-as-a-service platforms in recent years, Azure Search has emerged as a major player. Responding to businesses’ demand for faster, better insights from ever-growing data, cloud-based search platforms provide a scalable, flexible infrastructure for increasing speed-to-value. But for any search solution, cloud or on-premises, good insight discovery always starts with good data preparation. Organizations seeking to leverage Azure Search first need to find answers to data questions, such as:
- How do I get my content out of SharePoint (and other disparate sources) and process it for search?
- How do I enrich content, especially unstructured files, before indexing it into Azure Search?
- How do I write content to Azure Search?
- How do I extract entities, such as companies and people, from my unstructured full-text files?
- How can I ensure my users have access to only the documents they have permission to view?
Is there an innovative approach to address these data preparation inquiries? There is. By integrating Accenture’s Aspire with Microsoft Azure Cognitive Services, we can accelerate unstructured content acquisition, and significantly enhance content processing and enrichment in Azure Search.
The end results of this integration are presented in this demo (see demo instructions at the end of this blog), a collaborative effort between our team at Accenture Applied Intelligence and the Microsoft Azure Search team.
Wondering how we got there? Read on to learn more about Aspire, Azure Cognitive Services, and our demo architecture and methodology.
Aspire Content Processing Framework and Azure Cognitive Services
What capabilities do Aspire and Cognitive Services bring to ensure disparate content is efficiently acquired, processed, and enriched prior to indexing into Azure Search?
Aspire is a search engine independent content processing framework designed specifically for unstructured and semi-structured data. The framework supports complex content enrichment, contains a staging repository for efficient indexing, enables document-level security, and provides connectors for acquiring data from multiple content sources.
Azure Cognitive Services are a set of intelligent on-demand cloud services that can be connected to websites/applications via APIs and customized for business requirements. These services incorporate AI to better understand and cater to users’ needs, helping them use data to solve business problems faster.
Let’s look at the benefits of integrating Aspire and Azure Cognitive Services in the following hypothetical demo scenario. We’ll then dive into the methodology of the entire integration process.
Increasing Business Value by Improving Search: A Demo Scenario
Contoso Financial (a fictional organization) provides financial recommendations to investors. As part of this work, the advisors need to sift through large amounts of financial content. However, challenges arise when much of this content is stored in unstructured files, such as annual reports, financial disclosures, press releases, and internal research both on-premises and in the cloud.
Finding relevant information to support research can be a tedious process, let alone the effort it takes to find insights from this content. Using Aspire connectors, we can acquire all of Contoso’s financial information from disparate sources. Then, by combining Aspire’s content processing capabilities with Azure Cognitive Services, we can enrich the content with important entities, such as the people and organizations mentioned in the content.
The processed content is then indexed into Azure Search via an Aspire publisher, allowing Contoso's advisors to effectively search, explore, and derive insights from the content. When financial advisors have relevant information, they can deliver better value to their customers by providing more informed, well-researched recommendations.
Demo Development Methodology
Below is our architectural diagram outlining the methodology and components involved, from content acquisition, processing, enrichment, and indexing, to results presentation.
1. Ingesting unstructured content
Unstructured documents, such as PDFs, Office docs, images, can be ingested and processed in Aspire. Regardless of the original content source (SharePoint, File Systems, or any other systems), content can be acquired efficiently and securely using Aspire connectors, which support over 27 unstructured content sources. In our Contoso demo scenario, EDGAR 10-K documents (company annual reports) are ingested with an Aspire connector that fetches content directly from an Azure Blob container.
2. Configuring Aspire to process content and extract text
The acquired content is first processed in Aspire, which extracts text from the full-text EDGAR 10-K documents.
3. Integrating Aspire with Azure Cognitive Services for Text Analytics
The processed content in Aspire is then sent to Azure Cognitive Services via REST APIs for further enhancements. In our demo scenario, Azure Text Analytics – a part of Cognitive Services – is used for:
- Extracting entities within the sample EDGAR 10-K documents and providing links to Wikipedia entries and Bing IDs to make further distinctions between people, locations, and organizations
- Extracting key phrases
In addition to the features above, Azure Text Analytics also provides language detection and sentiment analysis, which can work with text extraction to help enable sophisticated search and visualization.
4. Indexing the enriched content into Azure Search
Once the content is fully processed and enriched, it is indexed into Azure Search via Aspire’s Azure Search publisher and made available to end-users via a search UI. In our demo scenario, users can access and analyze the EDGAR 10-K documents via search and visualization.
5. Displaying and visualizing search results
An Azure Search UI can be built depending on the organization’s requirements. In this demo scenario, the search results can be filtered by the facets (for example, people, organizations, and locations) identified during entity extraction. Based on the user’s search criteria, the application can produce corresponding visualizations, such as graphs.
In our example below, the graph displaying the organizations associated with "BioMarin" includes "Daiichi Sankyo" and "Asubio Pharma Co., Ltd." - two organizations that are highly correlated with BioMarin Pharmaceutical.
It’s also worth mentioning a security feature that is not implemented in the demo. For organizations requiring specific access restrictions, document-level security techniques can be applied during content processing to ensure users can only find and access the content intended for them. A security filter can be integrated with the .NET web application to help identify the appropriate user/group memberships for each piece of content.
Powering More Intelligent Search
Combining Azure Cognitive Services with Aspire’s sophisticated content processing capabilities provides a high-performing data preparation pipeline for Azure Cognitive Search. As a result, this will help improve data acquisition and enrichment, accelerate information discovery, and ultimately increase business value.
To see how the enriched data from Aspire and Azure Cognitive Services are presented in Azure Search, follow the instructions below to check out our demo.
- Visit our demo site (Google Chrome currently provides the best viewing experience)
- In the search box, enter a company name, such as "BioMarin," "Alcoa," "Accenture," "Walmart," etc.
- For each company query,
- The search results will display any EDGAR 10-K documents related to the company. Each resulted record contains the text and entities extracted during content processing.
- The faceted categories will display the People, Organizations, Locations, and Key Phrases that are found within the company's indexed EDGAR 10-K documents.
- To visually explore the relationships between the company and any related people, organizations, locations, and key phrases, click on the "Explore Graph Relationships" link right above the search box. Then, use the drop-down list next to the search box to specify organizations, people, locations, or key phrases.
- To go back to the search results view from the graph view, click on the "Back to Document Search" link right below the search box.
The development of this demo was led by:
- Paul Nelson, Innovation Lead, Accenture Applied Intelligence
- Eduardo Quirós-Campos, Functional & Industry Analytics Sr. Manager, Accenture Applied Intelligence
- Arturo Vargas, Functional & Industry Analytics Analyst, Accenture Applied Intelligence
- Maynor Alvarado, Functional & Industry Analytics Sr. Manager, Accenture Applied Intelligence
- Liam Cavanagh, Principal Program Manager, Azure Search, Microsoft
- Anupam Sharma, Sr. Technical Program Manager, Microsoft