Searching Wikipedia with Azure Search
A Demo by Search Technologies
Azure Search leverages the Microsoft’s Azure cloud infrastructure to bring robust search-as-a-service solutions, without the need to manage the infrastructure. With Azure Search, you can use a simple REST API or .NET SDK to bring your data into Azure and start configuring your search application.
We’ve built cloud-based search and big data applications in the Azure hosting environment, but Azure Search is still a relatively new player in the search market. So, just like how our Chief Architect, Paul Nelson, previously created a Wikipedia on Amazon CloudSearch demo, we decided to give Wikipedia content a new facelift: a demo for searching Wikipedia with Azure Search.
Check out our demo at wikipedia.searchtechnologies.com/azure/.
If you wonder how this Azure Search experience was developed, here’s a sneak peek into our project behind the scenes.
Think of the massive amounts of content you can find on Wikipedia – how did we go about getting all that data?, you may ask. We created an Aspire connector to download dump files directly from Wikipedia into the Aspire Staging Repository, which would then stream that data directly to Azure Search. So the process was a seamless transfer as no disks were needed to store data at any time.
A little background in case you wonder “What is Aspire?” Aspire is Search Technologies’ proprietary Content Processing Framework designed to handle both structured and unstructured data. The framework supports complex content processing, provides a staging repository for efficient indexing, and includes pre-built data connectors for acquiring data from multiple sources. Learn more about Aspire here.
The connector allowed us to acquire over 5 million English articles in XML formats, including their thumbnail image URLs, from Wikipedia. It also supports update processing, meaning that new data can be downloaded to the Aspire Staging Repository as new pages are added or updated on Wikipedia.
The open source DKPro (DKPro Wikipedia Library) – a special parser for Wikipedia syntax – was a great program to parse Wikipedia content and extract matadata, such as article update time, category, title, description, etc.
In his previous CloudSearch for Wikipedia demo, Paul discussed how complicated data processing can be when it comes to large databases like Wikipedia.
“Wikipedia pages have templates inside of templates inside of external links inside of internal links, and so on. It’s a mess, but frankly that’s typical of any large, richly textured data set written by human beings.”
– Paul Nelson, Chief Architect, Search Technologies
This Wikipedia-specific parser provided faster and more efficient access and parsing of this data maze.
Categorization is a natural way to sort content more efficiently. And the better we do entity detection, the better data will be categorized. Given the variety and volume of Wikipedia content, we built enhanced entity detection into our custom Wikipedia Aspire connector.
Together with the metadata provided by DK Pro, these algorithms enabled us to detect and categorize 40% of all the acquired Wikipedia content by Person, Place, or Organization.
How exactly do the categorization algorithms work?
- To detect "Person," look for date of birth and date of death
- To detect "Place," look for GPS coordinations
- To detect "Organization," look for founders
We also experimented with DBpedia, an open source effort to extract structured data from Wikipedia, to categorize more data, more efficiently. However, a challenge with Dbpedia was that it slowed down the indexing speed as we processed more data, so our solution to this was...
Aspire Staging Repository
We relied on the Aspire Staging Repository for quick re-indexing of processed content. Once data is cleansed and normalized, it would be stored in the staging repository, enabling us to re-index data faster.
The staging repository is a bridge between data sources (Wikipedia content in this case) and the search engine (Azure Search). As we changed the search engine configurations, we would only need to re-index data available in the staging repository without going back to re-crawl the original data sources. This approach solved the slow indexing issue by reducing indexing time from days to hours.
Learn about how the staging repository works in this video-blog.
The Microsoft Azure Search API was used to build a custom Aspire publisher, which matched and then streamed data and metadata fields, such as titles, descriptions, thumbnails, etc., from the Aspire Staging Repository to Azure Search.
Azure Search supports bulk inserts of up to 1000 documents in a single batch via the custom Aspire publisher, which significantly sped up our indexing time.
Also note that the amount of documents you can index into Azure will depend on the type of data plan purchased from Microsoft. In our demo, over 5 million Wikipedia documents were downloaded, processed, and indexed. Once you have taken an inventory of the documents that would be indexed for search, you can use Azure pricing calculator to estimate your total data allowance cost.
For linguistics features, we leveraged Lucene stemmers. For example, the stemmers would recognize "Olympic" and "Olympics" as two variations of the same word rather than two different words.
Query and Relevancy Ranking
As with any search engine, the relevancy algorithms in Azure Search need to be defined and tweaked continuously. The query speed is also getting better with every entered query via the use of cache.
One of Azure Search features is Multiple Scoring Profiles, which allowed us to set up a range of instances and switch them dynamically to adjust relevancy.
The majority of our relevancy tuning tasks for this demo was to handle metadata fields correctly and define the right relevancy weight for title and phrases.
For instance, there are five different scoring profiles in our demo:
These scoring profiles have been very helpful for ongoing relevancy testing and tuning. Potentially we can deploy more complex scoring profiles to adjust the engine scoring algorithms and further improve relevancy.
See a live Search Engine Scoring demo in this on-demand webinar.
The user interface configuration is just as critical as the back-end configurations. Our goal was to provide a graphically-appealing modern interface that is also intuitive and assistive to users.
The mobile responsive UI was built on Bootstrap, a web development framework that supports mobile-first web initiatives.
Following responsive UI design best practices, we:
- Created wire frames for desktop and mobile views.
- Developed a modern, flat design – in this case, the demo theme was inspired by Microsoft's Windows 10 look. You can work with Bootstrap or other platforms to build a UI that matches your brand requirements.
- Converted pictures into HTML templates.
- Connected to Apache Velocity background – an open source template engine that enabled us to create clean code for our graphic template and apply it consistently across the demo pages.
In addition to the Person, Place, Organization categories (Type) discussed above, we also leveraged the pre-defined categories of each Wikipedia article to provide easier related-content browsing.
Articles can be sorted (or related articles can be found) using the Category widget on the left of the results page or the Category button at the bottom right of each search result.
More Than Simple Search – A Platform for Cloud Machine Learning and Advanced Analytics
The cloud has made the development of search and big data applications easier and more efficient. With Microsoft providing the core infrastructure and search API, we were able to experiment the new Azure Search, acquiring large cloud storage space on-demand and configuring it to what we envisioned.
While each search engine has its pros and cons, we think Azure Search will further emerge as a prominent player in the expanding search-as-a-service space. Among what we found most useful was the ability to take advantage of a staging repository for efficient indexing and the Multiple Scoring Profiles for search relevancy improvements.
Machine learning and predictive analytics can also be incorporated to improve search accuracy algorithms (more in-depth discussion is in our Search Engine Scoring webinar). A range of open source big data technologies like the Hadoop ecosystem brings a solid toolbox for these tasks. Take our demo for example, in addition to building a search application, we could leverage Azure managed cloud environment to integrate machine learning and advanced analytics via Cortana Intelligence Suite. Organizations’ agility can increase with the ability to build end-to-end IT applications, including search and big data analytics, on a secure, on-demand enterprise cloud infrastructure.
Big data opens up new ways to advance search, but as always, improving search applications start with insights from the user experience and query intent. So if you haven’t yet, take a look at our demo for searching Wikipedia with Azure Search and email firstname.lastname@example.org to send us your feedback.
This demo is a Search Technologies project led by:
- Paul Nelson, Chief Architect
- Tomas Dockal, Engineer
- Pavel Stastny, Engineer
- Petr Kudela, Engineer