Everything You Ever Wanted to Know about Search Engine Security
(but were afraid to ask)
I’ve found that search engine security is one of the most poorly understood aspects of search engines. While there is a great deal of writing about relevancy and scalability, there is almost nothing written about search engine security architectures. This is unfortunate, because there is definitely a right way and a wrong way to do it. I see over and over again search engine vendors and content source vendors implementing a variety of bad ideas before they finally arrive at the correct solution.
This blog will be another in my series of “Graduate Level Courses” on search engine architectures (the first was on relevancy ranking). And so, we'll get into technical details on security architectures, methods and trade-offs.
So buckle up.
The Goal: No Compromises
Just to be clear, our goal will be to create a search engine that behaves in all respects just like a non-secured engine. The only difference will be that each user only sees a subset of the documents available – the ones to which each user has read access.
This is an important point, because most search engines have limitations, shortcuts, and inaccuracies when running queries with document-level security.
Specifically, our “no compromises” approach will have the following requirements:
- Only documents that the user can read will be in the search results
- It is not enough to just block out the metadata. If the user does not have read access to the document, it must not show up in the search results at all
- The total count of documents found for the query must be shown
- This number must be an exact number, and not an estimate
- The counts must only include documents to which the user has read access
- All facet counts must be shown
- Further, these numbers must be exact, not estimates
- The counts must only include documents to which the user has read access. (Note: “Facets” are the same as “Navigators”)
- Performance: Searches are fast
- Perhaps not as fast as in a non-secured environment but no more than 10% slower
- Stability: Searches will work even when the original content servers are down
- For example, you can still search for documents in content source X even if the original server X is down for maintenance
The Problem: Many Security Models
The problem with secure document search is that it is an example of a “broad problem” (see my blog on Types of Hard). In other words, because there are many different content sources, this also means that there are many different access methods and security models. (This is probably why search engines have such a hard time with security – because it’s annoyingly full of special cases).
So let’s first review some content source security models to get an idea of the scope of the problem.
Access Control Lists
Most content sources are based on Access Control Lists (ACLs). For the purpose of search, these are lists of users and groups that have access to read a document.
Note: All we care about is READ rights. Search is not concerned with any other access rights (write, delete, append, etc.).
In practice, roles are an intermediate step to determine if a user or group is allowed to read a document. For example, the “Accounting” group may be assigned to the “Reviewers” role which may be allowed to read all documents in a database.
A user can be made a member of multiple named groups. When the group is given access to a document, all of its members automatically receive access. Of all aspects of security, group membership is the most dynamic. People are hired and fired all the time. Therefore, group membership is forever changing.
Further, there is no cross-database consistency in groups. Most content systems will maintain their own set of groups. This means that you could be a member of the “SharePoint:Developers” group in a SharePoint system and a member of the group “Documentum:Developers” in a Documentum system.
In addition to being allowed access to documents, individuals and groups can be explicitly denied access. Typically, deny takes precedence over allow. In other words, if the same user is both on the deny and the allow list for a document, that user will be denied access to the document.
This will also be the case if the user is on the allow list, but one of the user’s groups is on the deny list. The user will still be denied access (for this reason, deny is less frequently used).
Hierarchical (Folder) Security
Many systems organize their documents into some sort of hierarchy (folders, directories, spaces, databases, cabinets, etc.). Often, security will be hierarchical as well. For example, you must have access to the parent container (such as the database or space) before you are allowed to browse for documents (and then you must have access to the document as well).
Hierarchical security means that access can no longer be defined as a simple list of users and groups. For example, if “Developers” are allowed access to Database X and then “Virginia Employees” are given access to document X.Y (document Y within database X), then only “Developers” who are also “Virginia Employees” will be allowed to read document X.Y. In other words, it is the intersection of the users in “Developers” with the users in “Virginia Employees” that have read access to document X.Y.
Over the years, we have seen a wide variety of other security oddities:
- Public Flag: Some systems will use a special flag to mark publicly available documents
- Required Groups: These are groups of which the user must be a member before that user can read the document. Required groups have same effect as hierarchical ACLs (see above), but are specified on the document itself
- Owner: Many systems have the notion of a document owner who is always allowed access, and takes precedence over all other constraints.
In the next section, we'll compare the early binding and late binding methods of security trimming.