Capturing, Manipulating & Indexing Document ACLs
By Paul Nelson, Chief Architect at Search Technologies
This is the fourth in my "Graduate Level" series of articles on document-level security in search engines. Previous articles describe:
- The Goals and Challenges of Security in Search Engines
- Early-binding vs. Late-binding models
- A Reference Architecture for Security
Indexing Document ACLs
To implement document-level security in the best possible way, every document need to be indexed along with one or more Access Control Lists:
ACLs are lists of users and groups that are allowed (or denied) access to the document. These lists will be used by query filters to ensure that search results only contain documents to which the user has READ access.
A complete implementation will index the following fields into the search engine:
- allowACLs: This is a list of users and groups to be allowed access to the document
- denyACLs: A list of users and groups specifically denied access to the document
- Deny ACLs take priority over allow ACLs
- parentACLs: A list of users and groups allowed access to the primary container that contains the document
- Only users who have access to both the container and the document will be allowed to view the document
- Any document with the “isPublic” flag set will be visible to all users
- This takes precedence over all other ACLs.
The Venn Diagram below shows how the above access control list fields interact.
When indexing ACLs, the connector will also need to implement a variety of other transformations to ensure that all parts of the security architecture work in harmony. These additional transformations are described in the following sub-sections.
Group Name Mangling
The “Developer” group inside a SharePoint site will be different from the “Developer” group inside a Jive system. Therefore, to avoid any chance of these two groups being confused, all group names should be mangled with the content source name or URL.
For example: “SPSiteX:Developer” and “JiveSpaceY:Developer” would be the full group names as indexed into the search engine.
Flatten Inherited ACLs
Some sources, including NTFS, have inherited ACLs. In such systems, the ACLs on parent containers (parent folders, in the case of NTFS) will be automatically propagated down to sub-folders. These ACLs can then be overridden by ACLs from sub-folders or sub-documents.
Unlike the “parentACLs” described above (where the parent ACL provides a further restriction on the access), inherited ACLs are simply combined with the explicit ACLs on the document to provide the complete description of all of the users and groups that have access to a document.
Viewed as a Venn Diagram, the set of users with access to the document is as follows.
Handling inherited ACLs is the job of the connector, which must read all of the inherited ACLs on the parent folder (and will likely need to cache them to improve performance), and then combine the inherited ACLs with the explicit ACLs on all of the nested documents in the folder.
Note that this situation will not use the parentACLs field described above. Instead, all of the inherited ACLs are simply added to the “allowACLs” field.
Incremental Indexing of ACL Changes
A common problem with many content sources is that a change to a document’s ACL does not count as a change to the document. In other words, it does not change the document’s “last modified date”. This is also a frequent problem with hierarchical ACLs and inherited ACLs. It is very rare to find a content source in which a change made to a higher level ACL (parent folder, database, space, etc.) causes the document’s last-modified date to change.
For this reason, most connectors use a snapshot method for identifying incremental updates to the database. Where document-level security applies, snapshots will hold the document ID, last modified date, and the ACL, so that any change in the ACL can be detected when comparing the new snapshot to the old snapshot. Indexing is triggered by differences between the old and the new.
Many ACLs contain group names that may be long and contain spaces and punctuation. For example, “SharePoint:Virginia Employees”.
Some search engines may not be able to index this item as a single token and may, in fact, tokenize it into multiple words, and index those words individually. If this were to happen, then phrase searches would be required for every user name or group name in the ACL, This would dramatically reduce performance.
To solve this problem, user and group names can be encoded as a stream of Base32 encoded characters. This would convert “SharePoint:Virginia Employees” into “KNUGC4TFKBXWS3TUHJLGS4THNFXGSYJAIVWXA3DPPFSWK4YA”. Because this has no punctuation or white space, most search engines will index it as a single token.
An alternative is to convert all user names and group names into an MD5 representation. For example, convert “SharePoint:Virginia Employees” into “88dd43e132fd8814f9e8271fbd747409”. This generally results in smaller tokens. The downside is that the encoding is not reversible, which complicates debugging.
Non-Standard Security Models
In our experience, a survey of how content source security models are implemented in practice, would find roughly the following:
- 85% use only “allowACLs”
- In other words, most document ACLs comprise a simple list of those users and groups that are allowed to view the document
- 10% also use hierarchical ACLs
- Some key content sources use hierarchical ACLs, such as Lotus Notes and Atlassian Confluence
- 4.9% use “denyACLs”
- 0.1% use other, more complex models
- For example, Documentum has “required groups” and “required set” access control lists. These are very rarely used
- Another example is inherited Deny ACLs in NTFS. Again, these are very rarely used
So how does one handle the 0.1%?
One method is to consider the entire security object for the document as a single, indivisible object. It is then indexed in the “allowACLs” field as a single identifier – instead of as multiple lists of users and groups. This identifier can come from the original content source (this is the case for Documentum), or it can be computed based on an MD5 signature of the original ACL information – essentially collapsing all users, groups, flags, and other security metadata into a single token.
In a sense, this ACL identifier is a "virtual group". A user is added to this virtual group during group expansion, if that user meets all of the criteria required by the security constraints as a whole.
The process works as follows:
- Document X has a complex set of security constraints --> CONSTRAINTS_Y
- Create an MD5 from CONSTRAINTS_Y --> MD5_Y
- Index MD5_Y in the allowACLs field for the document
- Set MD5_Y and CONSTRAINTS_Y aside in DATABASE_Z for later
- Compare the user (and her group membership and other metadata) to all constraints stored in DATABASE_Z. This includes CONSTRAINTS_Y
- Since this algorithm is comparing all of the user’s information against all of the information in CONSTRAINTS_Y, it can be arbitrarily complex (it does not need to be coded as a search engine filter query)
- If the user satisfies CONSTRAINTS_Y, add MD5_Y to the security filter (within the search request) against the allowACLs field
- Results returned by the search engine will include those for which the user meets the complex security constraints CONSTRAINTS_Y.
Note that this technique will only work when the number of these “specialty” constraints is relatively small. In the above algorithm, if there are thousands of different MD5_Y values to which a single user can be matched, then the algorithm will create queries that are too large to be executed quickly. If this is the case, then the only way to handle the complex structure will be to modify the index structure (additional security fields) and/or the search engine itself.
When the Search Engine cannot support Parent ACLs
All search engines that claim to have document-level security support the “allowACLs” field, and most also support “denyACLs.” Unfortunately, very few search engines support “parentACLs.”
This, by itself, is not a problem as long as the engine can support large queries containing hundreds of terms. With such engines, the security architecture can be supported outside the engine.
For those engines that only support a single ACLs field, and do not allow for large fielded queries, we can deploy a technique called intersection ACLs.
To illustrate, consider the following scenario:
- Confluence is the content source
- The Confluence “SPACE_A” contains the following access control list:
- [Groups] Developers, QA. Typically, these groups would be indexed into the parentACLs field
- There exists a document (Document_A) inside of SPACE_A that contains the following access control list:
- [Groups] Virginia Employees, Executives. Typically, these groups would be indexed into the allowACLs field.
In the absence of a parentACLs field, the connector could, instead, index all possible combinations of the groups as a set of “intersection ACLs”. These would be put into the “allowACLs” field, and would look like this:
- Each item (such as “Developers+Virginia_Employees”) is treated as a single entry inside the ACL
- These are “intersection ACLs” which are only matched if the user is a member of both of the groups specified in the pair
- Group pairs are normalized to reduce the total number of possible pairs (remove duplicates)
- For example, since “Executives+QA” specifies the same constraint as “QA+Executives”, the pairs are normalized by sorting the groups within the pair alphabetically
At query time, users will be automatically expanded to include all intersection ACLs to which they belong. For example, a QA engineer who works in Virginia will have “QA+Virginia_Employees” added to their security filter, along with all of the other groups created by the normal group expansion process.
As with all of these solutions, intersection ACLs will only work if the number of intersections to which a single user can belong is limited to hundreds of possibilities (not thousands). This is usually the case.
The next part of our Graduate Level course in document-level security for search engines, will look at group expansion and query modifications.
<< Previous article Group Expansion >>