By John-Henry Gross, Product Manager, Search Technologies
I was reading the latest National Geographic and like most of us, I thought, “why don’t my pictures look this good?” The camera manufacturers tell you to just “point and shoot” to get great pictures. The real answer is that professional photographers do a whole series of processing to the initial raw photographs to get the results we see published. They use tricks of the trade, such as cropping, dodging, burning, pushing, once practiced in darkrooms with light, lenses, filters and chemicals to get the desired results, now done digitally on computers with Adobe Photoshop and other such software.
This led me to think about a parallel with text search, how some additional processing of raw content can make a big difference in result sets.
We have all taken a picture where someone or something unwanted appears and as a result distracts from the overall composition and message. With photos, the solution is to simply crop out the portion of the image we don’t want.
We have the same problem with some text content we want to ingest and index for search.
Let’s take the example of crawling and ingesting current news from internet sites like BBC.com and nytimes.com. Even if you are crawling at the individual article level, there is often extraneous content such as ads, top stories and text links to other articles that distract from the central story and can cause irrelevant search results. In the above example page about the recent Oscar Pistorius legal case, if indexed “as is”, it would be returned for a search of “Egyptian elections”.
Oscar may have other problems, but the elections in Egypt are not one of them! Savvy users would recognize why this happened (hint, look at the Top Story section) however they still would be annoyed. Using the equivalent of the photographer’s technique of cropping, the HTML from the crawler could be passed to a content processing framework so only the text in the main article sections, the main article DIVs, for you techie people, would be handed to the search indexer along with the URL to the whole page. Doing this would eliminate the class of problem in our example.
Sometimes when I take a picture, important content is lost in the shadows or washed out, like someone’s face or the architectural detail of a building. What professional photographers do is take a series of pictures at different exposures and use pieces from these supplementary pictures to add back the missing content. If you have an iPhone, this is what the HDR option on the camera is used for.
The similar trick for text search is to bring in metadata from supplementary sources such as file paths, directory names, author, creation date, access control lists and file extensions to supplement the standard content being indexed for each document, giving us a more complete picture of the article to search. In some cases this metadata might be the only text we have for indexing!
For example, a zoo with a file server containing photographs of all the animals in their collection and with a directory structure based on the classic structures of Kingdom, Phylum, Class, Order, Family, Genus, Species, might want to make these images searchable on their public-facing website.
A good file system connector would send back the URL and directory structure information for each picture; however we still need some text for the search indexing system. Using a content processing framework such as Search Technologies’ Aspire product to analyze the URL and create a text file from the URL, and directory information for that photo, to be passed to the search engine for indexing into appropriate fields. This would allow a scientist to enter the search “Panthera tigris altaica” and get the photo of the Siberian tiger.
If we take a further step and use the same content processing framework to call out to a database of common resources associated with scientific names such the Integrated Taxonomic Information System (ITIS.gov), we can add those to the text being indexed for each photo and allow students in an elementary school class studying endanger animals, to enter the search “Siberian tiger” and get the same photo.
It would be great if all we needed to do was install and configure any of the great search engines available today and “point and shoot” it at our content and get beautiful search results. The truth is that just like photography, it usually takes some back-end, darkroom-like content processing magic to make a search solution the National Geographic cover that it can be.