Security: Early Binding versus Late Binding
This is the second in our series of "Graduate Level" articles defining a no compromise approach to implementing document level security in search applications. The previous article defined the goals and the challenges.
Search engine security is primarily concerned with filtering documents from the search results. This is called either “security filtering” or “security trimming” of the results. When looking from the highest level, there are two ways this can occur: 1) By modifying the query (early binding), or 2) By sifting through the results (late binding).
Spoiler Alert: Early binding is (by far) the better architecture.
Option 1: Modify the Query (Early Binding)
The first (and best) method for filtering results is to modify the search engine query to add a filter that limits the search to only those documents to which the user has access. This method is called either “Early Binding” or “Pre Trimming”. The architecture looks like this:
Document Level Security by Modifying the Query: Early Binding
As an example suppose that “johndoe” is a member of the “Developers” and “VirginiaEmployees” groups. If this is the case, then the following query:
Would be modified as follows:
(george washington) AND allowACLs:(johndoe OR Developers OR VirginiaEmployees)
In the above query, we can see only documents for which the “allowACLs” field contains either John Doe’s user name (johndoe) or any of his groups will be returned in the results.
- Easy to Implement: The security filter becomes a simple Boolean query over ACL fields
- Search engines are very good at executing Boolean queries ·
- Accurate Counts: Because the query itself is modified, the search engine will automatically compute correct counts for:
- Total number of documents - only those documents to which the user has read access
- Facet counts (note: “facets” and “navigators” are the same thing in search)
- High Performance: If implemented correctly, early binding can be implemented with minimal impact on performance.
- Requires Indexing the ACL: There is a lot of hard work just getting the access control list out of the original content source (in the correct format) and writing it into the search engine
- Very Large Queries: Some search engines don’t like to execute very large queries of 100’s or even 1000’s of terms (i.e. a long list of all of the groups to which a user is a member) ·
- ACL Changes Must be Re-Indexed: A change in an ACL will take longer to be reflected in the search results, because the document must be re-indexed before the changes take effect.
Option 2: Filter the Results (Late Binding)
With this option, the results are individually checked as they are produced by the search engine. This is called “late binding”, “post trimming”, or “the last minute check”.
Document Level Security by Modifying the Query: Late Binding
There are alternative approaches to filtering document results:
- Fetch the document ACLs from the original content source for the document
- Check it against the user’s username and group membership
- Ask the original content source: “does this user have access to this document?”
- This can be implemented with an HTTP “Head Request” or a “Get Request for 0 bytes”
- No need to index the ACL: In many cases, simply asking a content source: “Does this user have access to this document” may be easier to implement
- ACL Changes Immediately Reflected in Search Results: ACLs are not indexed; therefore, documents do not need to be re-indexed when ACLs change
- Can Handle Any Complex Security Model: Since access is checked document-by-document, late binding can be programmed to handle any arbitrarily complex security model.
- Extremely Slow: Documents from the search results need to be individually checked.
- This can mean having to check thousands or millions of documents
- Since all checks are against the original content source, this means search performance will be very slow
- Inaccurate Counts: To improve performance, total document counts and facet counts are often estimated
- Typically, late-binding systems will only check enough documents to fill out a page of results
- This means that the total document count (thousands or millions) and the facet counts are often estimates, based on statistical distributions, and are not fully accurate
- Paging Problems: Since the total document count is an estimate, so is the total number of pages
- This will often cause problems with paging (incorrect page count, page counts that change as you go from page to page, inability to jump to page -N-, etc.)
- Instability: The algorithm depends on all content sources being up and available
- If the content source server goes down, then searches from that source cannot be returned
- With many content sources, overall system stability is compromised
- Infrastructure Burden: Since every search requires communication to the original content source, that content source must have enough resource capacity to handle these requests
- In practice, this means scaling up your content source servers to handle additional load
- Such inter-system resource dependencies make scaling the system to high user numbers extremely difficult.
Option 3: Both
Some search engines do provide both an early-binding (or pre trimming) option as well as a late-binding (or “last minute check”) option.
Early Binding versus Late Binding – Analysis
There has been much debate throughout the search engine world about “early binding versus late binding”, to little purpose. From our perspective, early binding is the only scalable architecture.
The arguments for early binding are compelling: ·
- High Performance: Documents are filtered out as early in the process as possible. ·
- Searches Have Little or No Impact on Infrastructure: Early binding will not require you to add servers to your content source systems
- No Compromises: All document counts, paging, and facet counts will be perfect.
There are, generally, only two arguments against early binding:
1. ACL changes require a re-index:
The argument here is that the individual will continue to see the document in the search results after an ACL change (removing access rights), until the document is re-indexed. While this is true, the situation is mitigated by the following facts:
- Security is usually checked one last time when the user requests the original document
- This is because the original document is usually served up by the original content source, which will re-authenticate and do a final check to see if the user has access
- Late binding systems use ACL caching to improve performance
- But this caching comes at a price: ACL changes will not be seen in the search results until the caches are refreshed
- This is the exact same latency problem as for early binding
- Indexing on modern systems is much faster than it used to be
- Index updates for important collections can usually be optimized to quickly re-index documents when needed
- Some new systems perform fast ACL updates on the index
- And in some new systems, ACLs can be updated without having to re-index the entire document.
2. It’s hard to get the ACL from the content source.
This is a situation of “you get what you pay for.”
While it is true that fetching ACLs from content sources requires more software development effort, the advantage in performance and results quality more than compensate for the extra effort.
At Search Technologies, we are solving this problem by creating increasingly more sophisticated connector development frameworks, which allow us to solve these problems in as uniform and reliable fashion as possible.
Again, it is clear to us that early binding is the only scalable and dependable option for large enterprise search systems with document level security. It is the only one we recommend and the only one we implement at customer sites.
In the next article, we'll describe a recommended reference architecture for document-level security in enterprise search systems.
<< Previous article Reference Architecture >>