This article is the third of a multi-part blog Search for Big Data and Big Data for Search by Paul Nelson, Chief Architect at Search Technologies.
Realization #3: Search Beats SQL
I’m old enough that I learned about network and hierarchical databases as an undergraduate. Oh sure, relational databases existed, but there was still considerable debate about which database structure “would win.” I remember how, as late as 1986, an engineer who had written a book on hierarchical databases confidently assured me that relational databases were “just a fad.” “The world is naturally hierarchical,” he said.
When I first learned Quel (the RDBMS language of Ingres), I was in love. Here was a language that used set-mathematics to achieve a level of expressiveness I had not previously known was possible. It was a revelation and I was hooked. Eventually, of course, I was forced to switch to SQL, and my love faded. Using SQL has always felt like an arranged marriage (arranged by IBM and Oracle, at gunpoint), whereas Quel will forever be my true love.
But who am I to argue with success? SQL is 40 years old this year and still going strong.
I have come to realize that Search beats SQL. I think it’s hard to describe just how radical a statement this is for me. It’s about like saying that screen sharing beats a Boeing 747.
In other words, I had never before pitted SQL against search. For me, these have always been two fundamentally different tools, which have fundamentally different purposes and goals. One is text search – unstructured. The other is set mathematics – structured. They always lived happily together, each within its own domain.
But SQL has become stretched to the breaking point, and is being used for functions which are better suited to Search. Nowhere is this more true than in data warehouse applications, and OLAP. And this is what search can do for Big Data. It can make Big Data accessible to the world.
But why? Why search over SQL? Well first, because...
- It's fast
Search returns results in seconds, or fractions of a second. Data warehouse SQL queries take minutes. Or hours. Or days.
Speed delivers agility. Speed means there is less of a penalty for making mistakes. Speed equals play, and fun. Speed opens up data analytics to everyone, not just database programmers. This is even further the case because...
- It's familiar
Everybody knows and uses search, every day. Even my 85-year-old mother uses search and (until recently) so did my 96-year-old uncle (okay, admittedly he wrote the book on Prolog, back in the day, but still...).
The percentage of people who use search dwarfs by many, many decimal places the percentage of people who use SQL.
Sure, you say, but that’s just simple text search – navigating to web pages. That can’t compare to SQL, with all of its cool set-mathematics expressivity. No way!
And, until recently, I would have agreed with you. But not anymore, because...
- It's powerful
What's happened, especially with open source, is that engineers (like all my friends at Search Technologies) are getting inside the search engine, and making it do things it never could before. What sorts of things? Well, all of those things that we could (previously) have only done with SQL, like:
- Histograms: In SQL these would require group-by clauses with aggregations. In search, you choose a field and ask for it to be faceted.
- Multi-dimensional pivot-tables: Essentially, histograms (aka facets) across multiple fields, which produce tables of data. Heat maps, anyone?
- Group-by clauses: On high-cardinality fields, search can now return a list of groups rather than individual records. In other words, you search over records, but return people. Or companies. This is insanely powerful.In low-cardinality fields, search can return multiple results sets, for example the best results for each type of document, or each content source.
- Geographic distance computations & range filters: These have been in search for a while now.
- Math: More and more search engines allow for arbitrary mathematical functions to be computed on all results.Some search engines even provide special operators for computing standard descriptive statistics, such as average, std-dev, min, or max, for specified fields.
And not just on small data sets, but with really big ones, because...
- It's scalable
Billions of records can be searched in a fraction of a second. This happens 35,000 times a second on Google.com. Of course, Google has 10s of thousands of servers, but with cloud hardware, these barriers are starting to break down.
Search is scalable because there are no joins. Joins are painful and scale-killing because they require the matching of records from multiple tables, a process that cannot be easily distributed across massive networks of machines.
In search, documents are “pre-joined” when they are processed for indexing. Yes, this means putting more effort into how data is represented in the search engine (constructing “360° views” of entities, for example), but not as much effort as you might think, because...
- It's schema free
In a relational database, you create columns with CREATE statements. And then you fill those columns with INSERT statements. And if you have any sort of common data (a state name, for example) you create a new table.
All those columns! All those tables!
Months are spent creating all these tables and all these columns. Hundreds of tables and thousands of columns is normal. I see E-R designs all the time with diagrams that can only be printed on multiple newspaper-sized sheets of paper.
But search engines have never really required a schema. Wait, you say, early search engines (like mine, RetrievalWare, and FAST ESP) required a fixed set of fields, didn’t they? Isn’t that proof that search engines require schemas?
But the truth is that these were decisions to optimize disk space (see realization #1 above) and were never really requirements of search itself. All of the new search engines (Lucene, Solr, ElasticSearch, MarkLogic, Attivio, etc.) can handle any set of fields at any time, and have been able to for years.
[Note that Solr does have a schema.xml file, but within this you can have dynamic fields. Further, Search Technologies has extended Solr to allow for true XML fields, which allow for single field to hold and search over any sort of XML content]
How important is it to be schema free? Based on my experience, I would say the average billion-dollar company has at least 100 sources of data that could be combined into a comprehensive enterprise search and analytics system. That’s 100 sources x 25 tables or sub-sources x (say) 5 fields each = 12,500 fields to map. Search Technologies (a relatively small company) already has at least 14 different sources that I know of.
How can one possibly map all of that data into one massive schema?
Answer: You don’t. Simply output the records in XML and JSON (most systems provide this out of the box), and then load that data into a search engine. Then map fields as needed (see realization #1 above, about agility).
And finally, before I forget...
- It's secure
To handle enterprise applications, search engines (and the surrounding ecosystem, such as Search Technologies’ Aspire) have had to handle complex security constraints, and multi-model security, in a way that relational databases have never had to consider. For further information, see my security blog article.
This means that you can load in data from external sources, and the ecosystem will automatically include the ACLs for that data. Further, search engines are smart about users and group memberships, and will know (for example) that a user can be a member of LDAP groups, SharePoint groups, Jive groups, etc, and will only allow them to search and do analytics, on documents to which they are permitted access.
It’s clear that Search has numerous advantages over SQL – so what’s missing? Well, lots of things:
- We need more, better user interfaces.
There are lots more tools based on SQL at the moment. For example, Tableau. Search has some catching up to do (although Kibana is an interesting start).
- The user interfaces need lots more functionality.
For example, work groups, saved and shared queries, saved and shared analytics, plotting, printing, saved and shared folders of documents, read/unread flags, etc.
Here is where I think we can steal some lessons from government intelligence agencies, which are further ahead in terms of analytics, and intelligence gathering functionality.
- Standard query languages
SQL is (nominally) a standard. It is (mostly) the same across vendors.
Search query languages, in contrast, are all over the map. Functions listed above are handled in completely different ways by different engines. The industry is crying out for a standardized search / No-SQL language structure, with analytic capabilities.
Maybe it’s time to create one.
We’ll get there. After all, SQL has a 40 year head-start.
What does this all mean?
I have recently become fond of saying that there have been three computer revolutions in my lifetime:
- The Personal Computer. Yes, I wrote a simple program using toggle switches on my friend’s brand-new Altair 8800 computer. It added two numbers together. I also have a Kaypro luggable, an IBM-PC and one of the first Macintosh computers in my basement.
- The Internet: My business partner and I started our first search engine company in 1989, before the internet, which I personally peg to 1995 (when Netscape went public). That year was a weird experience for me. It was like going outside one day, and discovering that the sky had turned green, whereas just yesterday it was blue. Nothing was the same. My search engine, RetrievalWare (which lasted until 2007 before being sold to FAST), never really recovered from that sea change. And...
- Big Data: As with all revolutionary changes, it seems that, magically, all of the necessary tools gather together, just when they’re needed. Big Data requires massive distributed storage, elastic computing resources, distributed processing frameworks, flexible data representation, large-scale statistical analysis, and data mining algorithms...
I think that my overall message from this blog is to expect big changes in search over the next five years.
While the internet made search engines a household word (as Ziggy and Dilbert illustrate), the truth is that search engines have not really changed much since the early 1990s, when everyone switched to relevancy ranking, and distributed parallel processing.
Sure, the algorithms are more efficient, and we have facets, and some fancy analytics, but we’re still using the same old TF-IDF, and the general architecture (content processing --> tokenization --> index --> distributed search) has remained about the same for 20+ years.
So, it’s high time we shook things up a bit, and I’m convinced that embracing the frameworks provided by Big Data will be the catalyst to a flowering of capabilities which will have as much impact on the technology as it will on the way we do business, and on society as a whole.
Let’s grab the bull by the horns, and make it work.