EU’s GDPR Compliance: 10 Things You Probably Aren’t Considering, But Should
Incorporating Search and Big Data into Your GDPR Compliance Strategy
The GDPR, the EU regulation aiming to strengthen and unify the protection of personal data for EU citizens, will take effect on May 25, 2018. You can learn more about the regulation here. There’s no doubt that data, including the types governed by the GDPR, brings immense insights to support business decisions and outcomes. So, what does this new regulation mean for businesses?
It’s worth noting that the GDPR will impact not only EU-based companies but also those that handle EU citizens’ personal data, regardless of their locations. Furthermore, given the state of Brexit, the GDPR maintains on its website that “If you process data about individuals in the context of selling goods or services to citizens in other EU countries then you will need to comply with the GDPR, irrespective as to whether or not you [sic] the UK retains the GDPR post-Brexit. If your activities are limited to the UK, then the position (after the initial exit period) is much less clear.”
With the GDPR being rolled out soon, do you have a GDPR compliance strategy? For many organizations, the volume of data they hold makes compliance effort very daunting. For instance, how do you know where to locate the right data for compliance? How do you track consent? How do you monitor and detect non-compliance incidents?
Here are 10 things you may be missing but probably should consider doing, and why.
1. Unstructured Data Acquisition / Ingestion
The GDPR applies to all data, both structured and unstructured content. Consider doing the following:
- Expand ingestion to your organization’s unstructured content sources.
- Ensure that a person mentioned in any unstructured content source (e.g. e-mail exchange, customer support comments, survey results, focus group results, contracts, invoices, agreements, including scanned items, etc.) is included in the data discovery process.
2. Schema-Free Search
As discussed in my previous blog, search engines are more effective and scalable and can search over high-volume, high-variety structured content (e.g. tables of data) better than relational databases. So, consider:
- The total number of content sources will be vast. The same is true for the variety of content sources – many different tables, many different schemas. Search engines can ingest all that data without having to pre-process or normalize the schemas.
- Our advice is “Ingest anything, search it all for Personally identifiable information (PII). Don’t go through cumbersome, slow, and expensive ETL processes before you start exploring your data.”
- After all, a name is still a name, no matter what field. It will not matter if the person's name is stored in “person,” “name,” “client,” “customer,” “user,” or simply “X.” It will still violate the GDPR if you can’t track it.
- Search engines can easily search across all fields. And this schema-free approach helps break down barriers to PII (personally identifiable information) discovery.
3. Natural Language Processing (NLP)
To handle unstructured and semi-structured data, it’s useful to think about how you can leverage NLP. Using NLP techniques and tools, we can extract personally identifiable information (PII) from structured and unstructured content. For instance:
- Simple identifiers: emails, URLs, National IDs, phone numbers, etc.
- Complex identifiers: names, addresses, locations, companies
Once “we structure the unstructured,” cleansing and normalization can come next. Our content processing methods work efficiently with:
- Ambiguity – we are experienced with multiple techniques to resolve ambiguity and have developed a framework for adding as many as the client requires.
- A combination of statistical and dictionary/database-based approaches
- Very high precision – this works well with “divide and conquer” approaches
- Very fast and scales to hundreds of millions of patterns, entities, and variations
At Accenture, we are evolving “multi-model” approaches to identifying PII:
- With Machine Learning: how the identifier occurs in context
- With Machine Learning: the makeup of the identifier itself (character patterns)
- With Machine Learning: combinations of tags which strongly indicate PII
- Using dictionary-based approaches
- Using spell-checking approaches (with spell checkers trained based on company databases)
- Using pattern-based approaches (scalable to tens of thousands of patterns)
- Using regular-expression based approaches
And all of these approaches together. This is most important because often the presence of PII cannot be determined by a single approach, but rather the collaborative effort of multiple approaches all working together.
It’s not just enough to identify the presence of PII, but you will also need to identify exactly which person is identified.
But how? This is very tricky especially when there are a lot of “John Smith"s and “Zhang Wei”s in the world. Even worse, oftentimes these names or other IDs are incorrect and misspelled.
We (the Content Analytics Group at Accenture) can help using our matching technology (originally developed for recruiting companies and product companies).
With matching, we use both structured and unstructured signals with machine learning to match person records from across multiple sources. This matching method has many advantages:
- It is flexible to a wide range of information: names, name character patterns, IDs, ID character patterns, dates, locations, descriptions, purchase history, writings (e.g. e-mails, communications), etc.
- It is tolerant of error: exact matches are not required.
- It aggregates information across many signals to find the most likely matches.
These features are necessary because the GDPR will require that you accurately and completely identify a person’s information across your entire enterprise so that you can fully remove that person from your company’s databases when they want to be forgotten.
Matching on name + birth date is not enough (if it ever was). A matching algorithm which incorporates all available signals will be required.
5. Document-Level Security
Security is crucial. And so, consider building and/or tuning a scalable search application to have fine-grain document-level security controls. This will ensure that only the right individuals can access the documents intended for them.
You will be spending time with your database inventory, ingestion, and discovery. Be careful to maintain document-level security controls throughout the process.
We can also help ingest ACLs from underlying content sources. This can go a long way to determining risk. For example, if you have a UK employee and they created a document to which maybe only two people have access. This is a much lower risk than a document which has public access. Ingesting ACLs from underlying content sources can help determine levels of risk for your PII sensitive information.
6. Encrypted Search Engine Indexes
It’s also worth paying more attention to your indexing process. From our client projects, we’ve developed index design approaches that would encrypt the entire index with external keys without loss of performance or functionality. For example, even if the entire index is downloaded to someone else’s computer, the encryption would make the index useless without the proper access rights.
Encrypted search engine indexes can be an important safeguard when storing and searching PII sensitive data.
7. Real-Time Detection
To detect non-compliant incidents in real-time, consider approaches like our NLP and entity extraction methods, which are libraries that can be run on streaming data supported by technologies like:
- Apache Spark Streaming (builds scalable fault-tolerant streaming applications)
- Elasticsearch Percolate (matches new documents to a database of queries)
8. Data Classification
The following data classification techniques can give you some options to think about:
- Use machine learning and NLP techniques to determine the type of communication and whether it falls under the GDPR guidelines
- Use both “global classification” (looking at the communication as a whole, typically with predictive analytics) and “local classification” (examining individual sentences or data extracts for local indicators, using natural language processing)
9. Lineage and Provenance
Depending on your organization’s approach to data provenance, you can look at different supporting tools and techniques. As an example, our Aspire Content Processing framework’s ingestion approach always maintains the source IDs, source locations, and original hierarchy tree of where the original documents were located.
(Side note: Aspire is also excellent for high-volume unstructured content ingestion. Read about how it helped acquire over 1 petabyte of unstructured data in our recent project.)
10. Deletion, Redaction, and Replacement
Consider the followings throughout your data management and compliance process:
- NLP techniques can identify parts of documents which contain personally identifiable information (PII) which needs to be redacted.
- Aspire connectors maintain connections and APIs to many legacy systems. These connections can be leveraged, where appropriate, to reach back into those systems to delete/redact records inside of legacy systems.
- Aspire could also be used to pull the incremental updates and audit the changes to make sure that deletions and redactions were made properly and completely.
When you start thinking about data processing and considering a holistic, search-centric view of all your data sources, these 10 things can help chart out a strategy for maintaining, monitoring, and ensuring that your organization remains compliant with the GDPR. Similarly, in addition to the GDPR-related use cases, these approaches can also support many other compliance, fraud, and risk applications.
Contact us to see how you can implement these techniques and tools for maintaining GDPR compliance.