The Making of Aspider – A Search Engine Independent Web Crawler for Better Corporate Wide Search
Imagine searching for a piece of information in a trillion-volume encyclopedia.
That might be the case if the Internet had grown without any web crawlers. In fact, we can attribute the way it grew, in part, to this crucial software, just as skyscrapers grew taller when steel frames became possible.
The job of Internet web crawlers is to manage millions of web pages safely, as fast as possible, to reach a search engine index. In order to maintain safety, they have to be very careful with each Internet server (or host), sometimes requesting a page once every few seconds. However, how do they crawl millions of pages while requesting one every few seconds? The answer is simple: they process multiple requests for different hosts at the same time.
The scenario in the world of intranets is quite different from the scale of those Internet crawls, since only a handful of different sites need to be crawled, involving just a few hosts. If you run an intranet site with a web crawler designed for large-scale crawls (thousands or millions of hosts), using its default configuration, then it might seem to run too slowly.
Web Crawler Use Cases
At Search Technologies, we have seen this problem while using the Heritrix web crawler from within our Aspire Content Processing Framework – a search engine independent, powerful framework specifically designed for unstructured data. We tried modifying the “politeness policies” with insignificant results. Similarly, we modified the Heritrix source code in an attempt to improve performance, which led to potential instability of the crawls because of an inherent architectural incompatibly.
Security posed another challenge since Heritrix supports only Basic and Digest (HTTP-based authentication) and Cookie-based authentication (where you are redirected to a login page). Some of our customers have NTLM/Kerberos based sites. Modifying the Heritrix source code was risky since NTLM is a proprietary protocol and it wouldn’t be included in the main official project, thus we needed to keep maintaining our own version of the crawler.
Aspider: A Web Crawler Solution
When crawling intranet sites, the following advantages exist:
- Crawls can be scheduled to run at a low load time (for example, from 7 p.m. to 5 a.m.)
- The web crawler server can be located very near to the Host server, even in the same data center. Therefore, a request can be made to more than one page every few seconds.
Considering those, we concluded that we needed another way of performing web crawls to meet our customers’ needs. Voila! Our own web crawler project named Aspider (the Aspire Web Crawler) was born.
Our goals for the Aspider Web Crawler are to:
- Improve the performance for crawls involving a few hosts (intranet sites)
- Maintain Heritrix’ link extraction rates
- Use the Aspire 3.x connector framework features, including built-in distributed processing, built-in incremental snapshots, failover stability, and flexible crawl controls (for pausing and resuming a crawl even between system restarts).
- Support default Heritrix authentication (Basic, Digest, and Cookie-based). We also support NTLM/Kerberos in a more natural way.
This allows us to provide better technical support to our customers when a crawl fails, mainly because no third-party code is controlling the crawls.
How Did We Meet These Goals?
Goal 1: Better performance
We used our Aspire 3.x connector framework to dramatically increase crawl performance (learn more about our latest Aspire 3.1 release), but not in a strictly safe (too many requests in a very short time) and configurable manner at first. Therefore, to increase reliability, we added a highly configurable Throttle controller, so that crawling rates can be modified as needed.
Throttle control is a way of controlling the throughput of requests the crawler will make so that web servers (and system administrators) don’t get overwhelmed! To offer a fully distributed Throttle controller, we allow only one Aspire node to make requests to the same host at a time, releasing it as soon as the maximum URLs allowed per minute are reached. Then, the same or another Aspire node can process the URLs for the next minute. This is a simple, yet effective way to honor throttle limits no matter how many Aspire nodes are running the crawl.
Goal 2: Maintain Heritrix link extraction rates
Goal 3: Use our new connector framework
This was the easiest part since our Aspire 3.x connector framework did all of the work; we just implemented what was needed.
- High availability and failover using ZooKeeper for sync control
- Elastic scalability, so you can hot-plug more Aspire nodes to the cluster and they will automatically join to the on-going crawl
- Automatic snapshot-based incremental crawls
- Flexible crawl control over pause/resume/stop - you can pause and shut down the Aspire servers during the server's rush hours and resume at low traffic hours.
Goal 4: Keep Basic, Digest, Cookie-based and NTLM/Kerberos authentication
We used the Apache HTTP Library, which contains implementations to all of them.
Goal 5: Provide better support
Since this is our own code, we know the architecture and design thoroughly. Now we can help our customers address crawl issues faster.
Web crawling has been a crucial feature for Search Technologies ever since the release of Aspire 1.0. Our ability to provide better features and support has a big impact on any corporate wide search projects. Aspider is the result of our paramount concern that we exceed our customers’ expectations.
As a side note, we released our previous Heritrix Connector and Modified Heritrix Engine (with NTLM support and other features) as open source. Check it out on Github and contact us if you have additional questions or comments.