To Sink or Swim in your Data Lake

Everyone I spoke to at the Informatica World 2015 last month wants to build a Data Lake, but are afraid that the Lake will end up as a Swamp. Based on where we are heading, I don’t think using words like Lakes, Swamps, Reservoirs, or names of other large data bodies is going to work well. It helps with visualizing the solution but upon further discussion ends up creating more confusion.

Enterprises latch-on to these words, which in-turn encourages vendors to use the moniker on competing solutions. One such case is Capgemini/ Pivotal collaboration on the “business data lake.” What they built is a series of Big Data platforms that can consume any type of data and provide secure access to data inside the platform. I don’t see what’s new. To me, it seems like vetting available concepts with a name that is more appealing.

AAEAAQAAAAAAAAOHAAAAJDg4NGM2MTI2LTkwZWQtNDJjOS1hMzgyLThjZDBhZjBiMTYzMQ[1]Imagine; I coin a term “Data Vitamin.” Data Vitamins keep the data healthy (usable) and slows down aging (extensible storage). Obviously, this is very stupid idea even to begin with, but I’m so much in love with this idea that I refuse to be flexible with these foundation principles. In a year or two, you will see Data Vitamin A, Data Vitamin C, Data Vitamin B2, etc. pop up. In essence, these are doing pretty much the same thing as “Data Vitamin” with perhaps some minor variations. The same applies to Data Lakes, Data Swamps, Data Reservoirs, etc.

So, let’s take a fresh look at things.

It’s a Goodwill store

At a Goodwill store, you know that the stuff on display is gently used. In the store, you will find cloths, art, books, musical instruments, electronics, etc.

Similarly, in the case of data that has delivered its value to its source system doesn’t need to be exclusive anymore. In other words, it’s time for the data to go work for the community. This sounds simple and is not another data backup strategy.

Let’s take an example of a CRM as your source system. For how long do you think a customer log is held in the system? For 1 to 2 years? It’s very unlikely it stays in the system forever. To find your answer, look at the data archival strategy for the system. Statistically, there only a 28% chance that you will look at your archived data. This means that the source system has sucked all the value from that customer log.

The CRM system may have no further use of the data, but there may be other systems in the community that may find value in it.

This question comes up every time I equate a Data Lake to a Goodwill store. “Do I put all my data into the Lake even if it contains Personally Identifiable Information (PII)? I answer this question with another question – How many times have you donated 22-carat gold jewelry to the Goodwill store? It may happen, but only by accident. The source system must treat PII information carefully and ensure its safety.

Expect the best, plan for the worst

On the other hand, not everything you donate will sell; some could be deemed unfit for sale and will be discarded.

Apply some basic rules, not because you are investing millions of dollars to build the Lake, but because you are building a habitat and stop looking at these rules are data governance.

I find it very frustrating when I hear data quality, de-duplication, data optimization, and other processes are included under the pretense of managing data. These processes belong to the upstream and downstream systems and don’t necessarily need to be in the Lake.

So, what are the rules? For this, you need metadata. Each entry must be able to answer the following questions: why, who, what, when, where, and how? Now don’t get started with building logic to create metadata in the Lake. It’s like reinventing the wheel and very unnecessary. The upstream system that supplies the data incorporates this information during delivery.

Another question I get is – “Will this metadata stop by Lake from becoming a swamp?” My answer is – No. The purpose of associating metadata is to improve usability. If you feel the need to add a few more attributes, be my guest.

Data must die

Consider the size of Twitter data, assuming you’re tracking specific hashtags:

1 tweet = 2kb

25 tweets/ min = 50 kb

1 hour = 3 MB… 1 day = 72 MB… 1 week = 504 MB… 1 year = 26 GB

All this is just one hashtag. Twitter’s public streaming API accounts for about 10% of the full stream of data. Now imagine the storage implications if you are consuming the full stream.

Have you ever wondered why financial websites operate publically with a 15-minute delay? It’s because the data is already historical and can’t be used to establish trades; it’s already old. There is nothing wrong with using all the data available to process. The question is – Will you gain anything from it?

Now, let’s return to the CRM example. Although data generated here is less than 10% of Twitter, it grows over time. When the performance of the CRM system starts to decline, we find ways to archive old data. Why should the Lake be any different? The source system should set a date using a metadata attribute that tells the Lake when the data can be discarded. Now it’s up to the Lake to discard the data based on its demand. I like this rule – The Lake extended the life of the data by “x” each time the data is consumed. Now, don’t confuse data consumption with when it appears in query search results, it’s not the same thing.

Another question I get is – “Doesn’t this contradict the foundation rule of Big Data – Store everything, delete nothing?” My answer is – It depends. To keep the Lake from overflowing or turning into a swamp, you must define the boundaries and contents of the Lake.

Rate your data

Data comes in a wide range of flavors from raw to master-ed. The challenge is to be able to measure it on the same scale. Most users are more interested in relevance than the quantity of data in hand. Using metadata, the source system and the user can rate the data, thereby adding to the value.

Should we consider adding metadata attribute(s) that quantifies the quality of the data as perceived by the source? I say yes – these attributes will inform users of what processes have run on the data. Is it cleansed, is it normalized, is the address standardized, etc. We don’t need to know the exact rules that ran, but the value of data as perceived by the source.

For example, if it’s raw data entry in the CRM system, perhaps it has a potential value of 30. It’s not useless, but as it’s – cleansed, de-duplicated, address verified, and mastered, the value increases to 50, 75, 80, and 90 respectively. You may send this data to the Lake at any step, or at every step. The user doesn’t need to agree with this rating but can use for reference.

Rating becomes very complicated as it only represents 1-dimension, i.e., the value as perceived by the source. A source may rate the data it generates as 100, while one user evaluate its value as 30, and another may see it as 55.

One strategy could be to collect all these ratings and average them, which still keeps it 1-dimensional. The other strategy could to track the ratings from sources only. A user upon consuming the data posts the report (data) back to the Lake and rates it based on its value. A relationship between the report and the source data is created, allowing for a 2-dimensional rating system.

Now, you can imagine the amount of data and metadata that will flow into the Lake, although it doesn’t compare with the volume of data flowing through Twitter, but for your organization – it’s sizable.

Get your badge

I hesitate, but this is important. The Lake is for everyone but not for “everyone.” What I mean by this is – apart from aging and rating the data, you must consider securing it. Obviously, you have planned for a perimeter defense and various militarized zones to access the data, but what about the content in the data itself? Defense agencies do this all the time they implement protection at the very data element level.

Your business however may not be able to afford this as it eats into performance. Your business can however implement peer-to-peer data visibility. This is a way for the supplier to determine who can view its data. Think of this a Scout Badge, if the user has the supplier’s badge, he can view all of the data or none at all. Although this creates a process for the user to acquire a badge, the process remains external to the Lake management.

In summary, don’t get hung-up with the terms you use that makes it detrimental for growth and averse to change. Call it whatever you please as long as you are getting what you want. Sometimes, data from a Swamp may be what your business needs than “clean data coming out of a Faucet.”

Twitter @bigdatabeat

The author has previously published this article on Computer World.