Group Expansion & Query Modification
This article is part of my "Graduate Level" course on implementing document-level security for enterprise search systems. If you'd like to start from the beginning, then here is the first article on search engine security.
For document level security with early-binding, every query executed must be modified so only documents that the user is allowed to read will be returned by the search engine. This involves creating a query which looks like this:
(user's original query<user’s>) AND
( parentACLs: ( PUBLIC:ALL OR user OR group1 OR group2 . . . ) AND
allowACLs: ( PUBLIC:ALL OR user OR group1 OR group2 . . . ) AND NOT
denyACLs:( user OR group1 OR group2 . . . )
To implement this query, a list of all groups to which the user is a member must be gathered from all content sources (click the diagram below to view a high resolution version).
This involves connecting to each content source to fetch the groups to which the user is a member. This is required because the ACLs indexed with the document contain group names. If the user is a member of a group mentioned in an ACL, that group will need to be included in the OR expression (see above) when the security filter is constructed.
Why not expand groups as documents are indexed? Indexing a document with all of the members of the group (instead of just the group name) would make searches faster and would eliminate the need to determine the user’s group membership when executing the query.
However, group expansion at index time causes a requirement to re-index the document whenever group membership changes. In large companies, group membership can change dozens of times a day as people are hired and fired, or move between projects. Thus, expanding groups at index time would necessitate constantly re-indexing the entire database, all of the time.
So, determining the group membership for a user at query-time is a good trade-off, between query performance (and complexity) versus index performance.
While it is true that queries are slower because of the addition of group-related terms, since they are simple Boolean queries of mostly OR’ed terms, the execution of the filter expression can be made to be very efficient. Further, document indexes only change when the ACL itself changes (and not when group memberships changes), meaning that index updates are minimized.
A cache of group memberships must be maintained, in order to quickly compute the user’s group memberships whenever a query is submitted. There are two types of cache:
Caches which are built as queries are being executed
- This typically means that the first query can be quite slow, as all group memberships (for both the user and the user’s nested groups) must be computed
- Cache warming can be used to reduce the severity of this problem
Caches which are pre-built on startup
This is done by downloading all group memberships for all groups, ahead of time. This technique has several advantages:
- Group expansion can occur very quickly (it is a simple lookup into a pre-built database)
- Group expansion places no additional burden on content source servers
- Group expansion can occur even when content servers are offline
In general, the second technique is the preferred method. However, it does increase the delay between a group membership being revoked, and the affected user being denied search access to documents associated with that group membership.
Typically, group caches are periodically refreshed in the background. The frequency in which they are refreshed will depend on: How much of a burden the refresh puts on the content source infrastructure, how up-to-date the group membership has to be, and who are the most frequent users of search. Most systems will also implement a “refresh now” button, invoking and immediate update, for use in exceptional situations.
If I am a member of the “Architects” group, and the “Architects” group is a member of the “Technical” group, then I am also a member of the “Technical” group.
This method of defining groups within other groups is called “nested groups.” A recursive algorithm is used to expand these, and resolve a full list of each user's group memberships.
Since most systems today manage users through a central LDAP or Windows-Active Directory, the issue of multiple usernames is much less of a problem than it used to be.
However, if there is an old legacy system with its own separate database of usernames (and potentially a separate authentication mechanism as well), then some sort of link between the standard directory server username and the legacy system username will need to be created.
This is done on a case-by-case basis, and may include:
- String manipulation
- If the names can be linked by a formula (for example, “email@example.com”) then the name link can be implemented using a script as part of the group expansion pipeline
- A special user interface
- The user can be prompted for their username and password to the legacy system. If successful, a link between their legacy username and the standard corporate directory username is stored in a database
- Someone can manually map all of the usernames
The end result is a link between the legacy username and the corporate directory entry for the user. This is often stored in the corporate directory itself (i.e. in the LDAP or Active Directory server) and then used for group expansion when the user submits a search.
Special Cases: Group Name Mangling, Intersection ACLs, Complex Security Models.
Remember that the user and group names in the search security filter must match up exactly and in every respect with what was indexed into the ACL fields. Therefore, any transformation made to ACLs during indexing must be reflected in group expansion at query-time. This includes:
- Group Name Mangling: The content source name or URL must be added to every group so that it will be unique across all groups from all content sources.
- Intersection ACLs: If the search engine does not support the parentACLs field, then intersection ACLs must be computed (based on the user’s group memberships) and added to the security filter
- ACL Identifiers: If complex security models are implemented by indexing ACL IDs, then the user must be matched to the security constraints for every ACL ID, and matching IDs must be added to the security filter
- ACL Encoding: If the search engine has encoded the user and group names using Base32 or MD5, then these names must also be encoded during group expansion, before being added to the security filter.
Optimizing for Content Sources
If a content source does not require denyACLs or parentACLs, then the security filter expression can be modified to be more efficient.
Suppose, for example, that content source “A” has all three types of ACLs (allow, deny, and parent) whereas content source “B” has only allowACLs. The resulting search expression might look like this:
(user's original query) AND
( parentACLs:( PUBLIC:ALL OR user OR A:group1 OR A:group2 . . . ) AND
allowACLs:( PUBLIC:ALL OR user OR A:group1 OR A:group2 . . . OR B:group1 OR B:group2 . . . ) AND NOT
denyACLs:( user OR A:group1 OR A:group2 . . . )
Notice how the B groups are only included in the allowACLs field.
In practice, such an optimization will not improve search performance by very much. Tokens which are never indexed into a specific field will be quickly rejected by the search engine and automatically removed from the search expression.
However, removing these items from the expression will reduce the total size (number of characters) in the query string. This may be important if the search engine puts limits on the overall query size. And if there are lots of groups (hundreds) then it could reduce lookup time as well.
In order to implement this optimization, the content sources will need to report to group expansion if they contain any parentACLs or denyACLs. This can then be used by group expansion to compute the optimized query filter.
So, we made it!
If you've been able to follow each of my six Search Chronicles on implementing document-level security, then consider yourself a a Graduate!
I strongly believe that the approach detailed in these articles is a reference architecture for search engine security, providing high performance and stability, and at the same time making no compromises on security or search functionality.
To conclude this series, I'd like to share my thoughts and hopes for the future. What we need is an industry-wide security standard.
<< Previous article Industry-wide Standards >>