How to Build a Cloud Data Lake You Can Trust
When I was at Informatica World 2018 in May, I had the chance to talk to a lot of our customers. Whether I was speaking with IT leaders, business leaders, or both, one of the most common challenges that came up was how to build a cloud-based data lake the right way. In short, the problem was, “How do I stop a data lake from turning into a data swamp?”
As enterprise IT has geared up to grapple with big data in the last few years, many have embraced the idea of a cloud data lake: a theoretically endless expanse into which you can pump all the data your business is accumulating—without having to think about the data right now. That option is a huge win over traditional data warehouses, which require up-front thinking to make sure the data is fully governed and trusted, fit for a specific purpose. The data lake is more like that one drawer in the kitchen where you shove everything you haven’t figured out a place for yet.
And that’s how you build a cloud-based data swamp. It’s the reason for the common lament about data scientists: These multi-Ph.D. math wizards are in high demand, and have the potential to deliver one transformative business insight after another. Yet they end up spending 70 percent of their very expensive time just cleaning up messy data before they can even get started.
Pay now or pay later
The illusion of the cloud data lake is that you don’t have to get bogged down in data governance. But as that frustrated data scientist finds out later, you do have to deal with the management and governance of your data. You just have to decide whether to do it now or later.
My advice to our customers is always to do at least some of that governance up front, because it’ll save you a lot of time later. It’s impossible, of course, to completely prepare data for uses you haven’t even thought of yet. But establishing, and automating, a rudimentary amount of governance on the data as it goes into your lake makes sure that it’s more trustworthy when you draw on it for a specific initiative.
Therefore, IT leaders who understand the technology and the business users who understand the data need to collaborate to establish a minimum level of quality and governance. Your starting point is to work with the metadata, so any user of the data can understand the source, how the data is defined, whether that data has an owner in your data governance organization, etc. A crucial example: Basic governance will enable information security and identity governance, as well. If an employee isn’t authorized to see certain data in the application it was created in—such as HR data, or customer credit card data—then he shouldn’t be able to access it in your data lake, either. That’s a function of data lineage: Where did this data come from, and what security and governance rules apply?
While you’re not trying to bog down data ingestion with too much up-front governance, you do want to make the data readily useful. Analysts and data scientists should be able to do self-serve data prep to readily find data assets and have a full view of lineage and relationships such as data domains, users and usage, and other related data assets.
Furthermore, a common metadata foundation will prepare your data lake for innovative uses of machine learning and artificial intelligence that will require context to make use of your data. We’ve been explaining to customers the idea that metadata is a critical component of your data strategy because that’s where you determine the difference between a data lake and a data swamp. And if you take that idea a step further, I would argue that your metadata repository (a.k.a. your data catalog) will soon become the most valuable set of information your company maintains.
Many layers of truth
A traditional data warehouse is built to be a very structured single source of truth. That’s why you do all the work around data quality up front—to create an absolutely reliable database. You can’t do that with a data lake, but a solid understanding of your metadata will give you data that’s easier to work with, more reliable, and more immediately actionable.
That’s not to say there’s a single solution for the data in your lake. In fact, we find that the best approach is to seek varying stages of quality. You may have reason to have data that is so undergoverned that it’s “use at your own risk” in terms of analytics investigations—a place to play around. But no insights can be trusted as actionable until you repeat the experiment with a better-governed set of data. And then you can have other levels of data that are better governed, even approaching the “single source of truth” you’d find in a completely structured data warehouse.
For more on how to build a cloud data lake that you can believe in, read our white paper, “The CDO’s Guide to Intelligent Data Lake Management.”