Data Streams, Data Lakes, Data Reservoirs, and Other Large Data Bodies
A Data Lake is a simple concept. They are a catchment area for data entering the organization. In the past, most businesses didn’t need to organize such a data store because almost all data was internal. It traveled via traditional ETL mechanisms from transactional systems to a data warehouse and then was sprayed around the business, as required.
When a good deal of data comes from external sources, or even from internal sources like log files, which never previously made it into the data warehouse, there is a need for an “operational data store.” This has definitely become the premier application for Hadoop and it makes perfect sense to me that such technology be used for a data catchment area. The neat thing about Hadoop for this application is that:
- It scales out “as far as the eye can see,” so there’s no likelihood of it being unable to manage the data volumes even when they grow beyond the petabyte level.
- It is a key-value store, which means that you don’t need to expend much effort in modeling data when you decide to accommodate a new data source. You just define a key and define the metadata at leisure.
- The cost of the software and the storage is very low.
So let’s imagine that we have a need for a data catchment area, because we have decided to collect data from log-files, mobile devices, social networks, from public data sources, or whatever. So let us also imagine that we have implemented Hadoop and some of its useful components and we have begun to collect data.
Is it reasonable to describe this as a data lake?
A Hadoop implementation should not be a set of servers randomly placed at the confluence of various data flows. The placement needs to be carefully considered and if the implementation is to resemble a “data lake” in any way, then it must be a well-engineered man-made lake. Since the data doesn’t just sit there until it evaporates but eventually flows to various applications, we should think of this as a “data reservoir” rather than a “data lake.”
There is no point in arranging all that data neatly along the aisles because when we get it, we may not know what we want to do with it at the time we get it. We should organize the data when we know that.
Another reason we should think of this as more like a reservoir than a lake is that we might like to purify the data a little before sending it down the pipes to applications or users that want to use it.