The Google Search Experience for Your Intranet
The most common user request of all: Can't we just have Google on our Intranet?
- The Google Search Appliance has an excellent reputation for plug-and-play search applications, typically involving a single data source
- In the enterprise environment, multiple and heterogeneous data sources pose additional challenges
This article looks briefly at how Google.com has been customized to the Web search environment, and draws lessons to guide implementation of the Google experience for enterprise search / intranet applications.
GOOGLE.COM - A CUSTOMIZED ENVIRONMENT
Google continues to enjoy a world-leading reputation for Web search relevancy. A key reason for this is the core indexing and search technology, developed over fifteen years by the world's largest team of search engine engineers. The Google Search Appliance (GSA) benefits from this investment and provides strong core relevancy, speed, and scalability.
However, there are important differences between large-scale Web search, and enterprise search. Any organization hoping to provide a Google-like search experience behind the firewall should consider these differences.
The key issues can be summarized as follows:
- Content Normalization: On the Web, millions of marketers deliberately write Web pages to be found. This means adhering to standardized html formats, and ensuring that metadata such as the Title and Description fields are complete, and accurately represent the page content. In contrast, behind the corporate firewall, documents are seldom written to be found, and typically lack useful metadata. These issues of content normalization and metadata improvement can be addressed but it requires a plan
- Content Processing: Google.com performs a large amount of content manipulation and analysis prior to indexing. For example, it examines the interlinking of Web pages and computes a "PageRank" score for each Web page. This helps Google to differentiate between pages and sites that are authoritative on a subject, and others that are less so, on a graduated scale. Content pre-processing is also required behind the firewall, where multiple data sources are involved. Interlinking seldom exists, or where it does, only applies to part of the data set. A proxy is needed, and can usually be found. Tools and best practices exist to achieve this aim, but it requires a plan and some implementation effort. Unlike the Web, corporate content is far from homogenous
- Data Connectivity: Google.com uses a highly customized Web crawling infrastructure to gather Web pages for indexing. It has the advantage that authors want to get found, so standards are adhered to, and various types of Site Map are made available by website owners to help the Googlebot efficiently gather pages. In the enterprise search environment, things are not that simple. Multiple repositories, typically imposing document-level security ACLs (unlike public Web pages), and with limited methods of bulk content export, must be "crawled" to gather documents for indexing. Failure to connect in a sustainable manner means that content is not made available to be searched
These are not the only differences between Web search and enterprise search. But they are important, and a discussion of how they can be addressed should be a priority for any organization hoping to deliver that Google experience behind the firewall.
Search Technologies provides services and add-on software tools, using officially supported Google APIs and frameworks, to address data connectivity, content normalization, and pre-processing.
Combine our services and tools with the core capabilities of the GSA, and you'll have everything you need to recreate the Google experience on your intranet.
As a growing number of Google Search Appliance customers will testify, a well-tuned enterprise search system is transformative for business productivity.