6 Reasons Why Big Data Projects Need Search Engines
They Go Together Like Peanut Butter and Jelly!
One common comment we hear from our customers is “My projects don’t involve search, at least not yet.” But in fact, many seemingly “non-search-based" applications that handle large volumes of data could benefit greatly by using a search engine.
Doug Cutting, the creator of Hadoop and Lucene, once said:
“You know, people today think that search and big data are separate but in two or three years, everyone will wonder why we ever thought that.”
This is my feeling as well, and it has been for years now. Search engines and big data go so well together that you really should never have one without the other.
And here are six reasons why.
Reason #1: You Need to Find Stuff
Any organization that has data will need to find things (rows, cells, files) inside the data. Enter the search engine. No matter what data you have, guaranteed, you will need to search it. You should load that data into a search engine and make it searchable.
Search engines are fast. You can search for anything and get the results back in milliseconds.
Maybe one of the problems is that people think “Search Engine” = “Text Search.” But that is not the case! Search engines can search over structured content (e.g. tables of data) way better than relational databases. Search engines are so much faster and more flexible than other searching techniques. And, you can search on portions of fields, such as addresses, names, job descriptions, patient notes, etc. about 1000x better and about 1,000,000x faster than using the “LIKE” keyword.
Really, for any sort of search for any sort of data, a search engine is best.
For example, we have customers who use search in big data projects involving:
- Patient records
- Financial transactions
- Log data
- Lab notebooks
- Tractor data
- Weather data
- Passenger data
The list goes on and on. Most of these are not unstructured documents; they are structured databases. Why do they use a search engine? Because it is fast, easy-to-use, and more flexible.
Reason #2: You Have Lots of Data, as in Big Data
Now we get to the crux of the argument. Search engines are scalable. They can handle tons of data - billions and billions of records - quickly, easily, and fast. Search engines are more scalable than just about any other type of access systems (I’m looking at you, relational databases!).
Everything in Big Data today is all about “distributed” this, “sharded” that, i.e., dividing up jobs in pieces and spreading them over a cluster.
When I wrote my own search engine (RetrievalWare), I also created a distributed, sharded search system. When was this? 1993. Yes! Search engines have been sharded, distributed clusters of machines before the Cloud went mainstream and gave rise to the Big Data revolution.
Of course, the reasons why search engines are so scalable has to do with two things:
- The index structure, which is easily sharded
- The search mechanism, which is easily distributed to large clusters of machines
One needs no more proof than to see Google.com search tens of billions of records in a fraction of a second to see how powerful this is.
Reason #3: You Need Analytics
Do you need to analyze data for business intelligence or business analytics? Most likely, every organization will say “yes.” And so, use a search engine! Search engines can analyze data. Lots of it. Very fast.
What I’m talking about (mostly) are simple things like histograms, counts, sums, averages, etc. The advantage of a search engine is that it can do this over hundreds of millions (billions!) of records in a second or two, which is 1000x faster than any other alternative.
OLAP is dead. Use a search engine.
Do you remember “online analytical processing” (OLAP)? Multi-dimensional hypercubes? Star or Snowflake schemas? There is no reason to use RDBMS technologies for any of that anymore. Search engines have made OLAP for business intelligence and business analytics, obsolete.
What does that mean? Search engines can execute searches for dashboards, business reports, exploratory analysis, online responsive analysis, and self-service analytics much faster and in a much more user-friendly manner than any other technology. Search engines coupled with visualization tools like ZoomData and Banana can provide all (or most) of the graphs, charts, bubble diagrams, etc. that you will ever need.
Online analytics is hungry for aggregated data. Search engines can provide that data. FAST.
Reason #4: Your Data Does Not Fit into Tables
While it is true that, search engines handle tables (in many ways) better than relational databases, how do you normalize a video? A contract? The genome? Instrument voltage readings?
Take Cloudera CDH, on which we have built many big data applications, for example. Many of our engagements go like this:
Customer: “We’re putting everything into Impala!”
Customer: “Umm… we have this weird type of multi-valued data that doesn’t fit well. Can search handle it?”
Customer: “You know, we also have these five text fields. I guess they would be better in the search engine too, right?”
Customer: “We discovered that we have to do 99 joins and it’s really slow. Can we just do multi-valued fields in a search engine?”
Customer: “You know, maybe it just makes sense to put everything into Cloudera Search? We’ll still use Impala for preprocessing.”
And that’s how it usually goes.
Reason #5: You Don’t Know What’s in Your Data
It’s the mantra of the data lake: “Load all your raw data into a data lake and we’ll figure out what to do with it later!”
So what do you do? You spin up a dozen teams, point them at all of your business systems, and say “Go fetch data!” Now, after a year or so, you have a data lake full of data. There are hundreds of millions of files and folders and billions of records spread throughout your cluster from thousands of systems.
So... Now what?
Point a search engine at it. Search can index, pretty much, anything. Of course, a search engine will do a better job if it knows what’s inside, but if it doesn’t? Well, no biggie. A search engine will just index it and you can do any sort of keyword search on it.
The first thing after loading up a data lake with data is to find the files and folders that contain interesting things. Since a search engine does not need to be pre-defined with a schema (it can index any random bag of tokens, unlike a relational database), it can help sift through your billions of files and folders to find those that contain useful data. Then, you can start processing for real.
So yes, use a search engine just to find useful data in your data lake, especially when you have a massive lake and you don’t know what’s in there. It’s a good place to start.
Reason #6: It’s Easier Than You Think
It sounds hard – sending everything to a search engine – but it’s really not that hard with the technologies we have today. Why? Because search engines today are tightly integrated with big data.
A common example of our projects is Cloudera Search – the search engine running directly on the Cloudera platform. The indexes are stored in HDFS and can be managed through Cloudera Manager. Further, you can use HUE (the Hadoop User Experience) to do searches, dashboards, and cool analytics.
There are also many other tools that help with this process. Search Technologies’ Aspire platform can now index HDFS files and publish them to Cloudera Search or other search engines. And there are many tools and components (including our own Aspire system) for reading through tables and other structured formats (such as JSON Lines formats), parsing them, and indexing them into search engines.
So It’s Obvious Now - If You Have Big Data, You Need Search!
Search engines and big data go together like peanut butter and jelly! They are the perfect combination.
Incorporating search into your big data projects can help solve all sorts of problems. So, consider implementing search early on in your project planning. Indeed, we’ve come across many use cases that underscore this. Check out some examples below and connect with us to see how search can fit into your big data projects.
- "In the Trenches with Search and Big Data" v-blog series
- Top 5 big data analytics use cases that involve search