The Rise of the GDPR Data Lake

The new EU data privacy regulation, GDPR (General Data Privacy Regulation), is driving the need for a Data Lake to hold the volumes and types of customer data required to be covered by the regulation.

What is GDPR?

GDPR is the new EU data privacy regulation that comes into force in May 2018 and requires organisations to better manage customer data in the context of providing fundamental data rights to EU citizens around access to data about them and portability of their data. More details on the regulation and what’s required to address it are available from a recent blog on this link here: ‘GDPR – The Next Major Data Privacy Challenge’

It’s significant for a number of reasons including:

  • Citizens right to rapid notification of a breach
  • Citizens consent to the processing of their data & their consent on its utilisation
  • Citizens right to the portability of their data to other providers
  • Citizens right to request a Subject Access Request (SAR)
  • Citizens right to be forgotten (where appropriate)
  • Size of a potential fine – up to 4% annual world-wide turnover
  • Applies to organisations who hold EU citizen data even if they outside the EU

Most major organisations across the EU are investigating what this new regulation means for them, with some already planning strategies and solutions for how to address it.

So where does a Data Lake fit in?

Before we get to that, let’s start from where many organisations will begin this data journey.

Most Financial Services organisations have Customer data scattered across their enterprise. It’s stored in lots of places, used for lots of different things, is used in raw and aggregated forms, comes in different types and formats, varies in data quality, it comes in from multiple places, it goes out to multiple places, is used by a wide range of business professionals for a wide range of tasks.

So this is what we know it’s used for. It also proliferates around and across organisations as they try to mine this data to better understand the needs of their Customers, to provide better service and create more opportunity.

So firstly, we have the task of actually finding all the Customer data. You’ll find a recent blog article on this link here: ‘GDPR – Where to start?’. Let’s assume we’ve done that.

Once we’ve identified where all the Customer data is, we need to think about how we understand how the data is related. Given that relevant Customer data could be held in a number of different formats, as well as multiple places, we’re going to need to be able to achieve a number of things to be able to relate the data.

Some of things we’ll need to be able to do include:

  • Access a range of types of data including structured, semi-structured and unstructured
  • Access from or store in, a central location
  • Apply analytical techniques to create content which can be identified as belonging to a specific Customer
  • Quality assess and remediate data to ensure we have the best data available
  • Match records together, based upon defined criteria, so we can understand whether these records relate to the same Customer or not
  • Create a linkage between these records so we can build up a view of all the records belonging to the same Customer
  • Store this linkage data, and the actual data if required, in a ‘Catalogue’ of Customer data in a central location
  • Provide analytical tools on top of the linkage and actual data to provide direct or indirect answers to GDPR specific questions, such as providing data for a Subject Access Request (SAR)

We haven’t covered some of the more technical, but very important, aspects of this data management process but from the above section you’ll get the idea of what we’re trying to achieve. Things like how to properly load data, profiling of data etc. as assumed to be carried out as part of any operational data management processing.

The diagram below shows a very simplified representation of how the GDPR Data Lake would fit into an operational data management process.




In the diagram above, you’ll see that the GDPR Data Lake is a larger representation. That’s to recognise that it’s probably some form of Big Data based capability.

Being able to match and link data, regardless of where it has come from, is an important aspect of GDPR as it will form the basis of many answers to the questions the regulation will pose. Some examples of how this data might be used include:

  • Where to go for data to fulfil a Subject Access Request (SAR)?
  • What data does the ‘consent’ apply to and where is it?
  • If data has been deleted or masked, where is that recorded for analysis?
  • Provide evidence of a proper data management process around GDPR principles
  • Source data to support portability to other providers
  • Provide a source of data for self-service capabilities around GDPR

So I’ve called this the ‘GDPR Data Lake’, as it has all the hallmarks of a typical Data Lake but with a focus around GDPR.

Isn’t it like a Marketing Data Lake?

Okay, that’s fair enough. Yes, it looks a lot like a Marketing Data Lake. If I was building a Marketing Data Lake, I’d be doing many of the same activities and with much of the same data.

There are some slight differences though:

  • Rather than holding Marketing Preference data, we’re holding ‘Consent’ data
  • We’re going to hold data about what’s happened to the source data (lineage) as evidence of the fact we have the right data records matched and linked
  • If we’ve masked data; where, how and why?
  • If data has been deleted, we need to record the history of that fact as evidence

This isn’t an exhaustive list but you can see that with a few slight changes the principles behind a Marketing Data Lake and a GDPR Data Lake are very similar.

This is really important as many organisations have, or are building, Marketing Data Lakes with the appropriate infrastructure to support them. In this scenario, it means that one of the core capabilities required for GDPR can be constructed from something that the organisation either already has or is building.

With the pattern for a Data Lake already well established, reconfiguring this approach to meet the needs of GDPR should reduce the time, cost and effort of delivering this important capability.