Back to top

A Reference Architecture for Document-Level Security in Search Systems

This is the third part of a "Graduate Level" series on document-level security in enterprise search. The previous article discussed Early-binding vs. late-binding models.

The Reference Architecture
The recommended reference architecture for document-level security with search engines is shown in the diagram below (click for larger picture).


Note that all work for document-level security can be implemented entirely external to the search engine. 

What’s required from the content source? 

1. APIs for fetching all documents with metadata and Access Control Lists (ACLs).

  • This includes all ACLs and metadata necessary to implement the security model enforced by the content source including:
    • Parent ACLs (if any)
    • Document ACLs
    • Other security metadata such as "is public" flags

2. APIs for fetching user/group membership

  • This can either be done as a single export of all group memberships (generally preferred), or on a user-by-user basis

What’s required from the search engine?

  • Ability to index fielded data separate from the document text
  • Ability to execute Boolean queries over ACL fields, including:
    • OR() - an absolute requirement
    • NOT() will be required if the content source uses deny ACLs
    • AND() will be useful (but not required) if the content source has hierarchical ACLs
  • Ability to execute very large queries (hundreds of terms)

The remainder of this document discusses each section of the reference architecture in more detail. 

The Algorithm 
Like most search engine algorithms, implementing document-level security requires two parts: indexing, and query. 

The Algorithm, Part 1: Indexing Path 
Indexing is entirely managed by the connector:
1. Scan the content source for all documents to be indexed
2. For each document:

  • Fetch the document content and extract text from it
  • Fetch other document metadata (title, author, last modified date, etc.)
  • Fetch Access Control Lists (ACLs) and other security metadata
  • Convert ACLs and security metadata into the security fields specified above:
    • "isPublic" = true if the document is available to all (all other fields are ignored if this is true)
    • "denyACLs” for users and groups specifically denied read access (takes precedence over all other ACL fields)
    • “allowACLs” for users and groups who are allowed to read the document (includes inherited ACLs, if any)
    • “parentACLs” for users and groups from a parent container (such as the database or space) that must also be checked before allowing access

3. All documents are then submitted to the indexer.

  • ACL fields are indexed as standard search engine fields.

Note that the connector may also need to handle incremental indexing, name mangling, ACL encoding and other processing, depending on the requirements of the content source and the search engine. See my “indexing ACLs” article for more details. 

The Algorithm, Part 2: Query Path 
Early Binding requires modifying the query before it is submitted to be executed by the search engine. This can occur either inside or outside the search engine server. The algorithm for this is as follows:

1. Authenticate the query request

  • This is often performed by the web application that manages the user interface.
  • There are multiple technologies available for authentication, including:
    • Basic authentication: Prompt for a username & password
    • Single Sign-on (SSO), Using NTLM, NTLM2, Kerberos, or other tools such as SiteMinder
  • The result will be the username of the user who is submitting the request.

2. Gather all groups to which the user is a member (called “Group Expansion”), including:

  • Groups from LDAP or Active Directory
  • Groups from every content source indexed into the search engine
  • Nested groups (where groups are members of other groups)

A group cache is usually required to perform this with speed and reliability.

3. Modify the query.

  • The query from the original user is modified to include a clause that filters out all documents for which the user does not have read access
  • The Boolean expression for the security filter clause will be

isPublic:true OR 
    ( parentACLs:( user OR group1 OR group2 . . . )   AND 
         allowACLs:( user OR group1 OR group2 . . . ) AND NOT 
         denyACLs:( user OR group1 OR group2 . . . ) 

4. Execute the query

  • Once the query is correctly modified, it can be executed like any other search query.

5. Return the results

  • The standard search engine results are returned, including the total document count, facets (with facet values and counts) and search results with metadata fields.
  • Because the results have been filtered by the security filter, only documents to which the user has read access are returned and included in the counts.

It is critical to note that all of these steps must be performed in a secure area on the server, either in the user interface server or inside the search engine server itself. This is necessary to make it impossible for the user to tamper with the HTTP transaction URL (or other data) to give themselves more access rights than they would normally have.

So, we are well past the halfway point in our Graduate Course on document-level security in enterprise search. The next article in this series will address the indexing of Access Control Lists.


<< Previous article                                                                                Indexing ACLs >>