Data Lakes: Data Nirvana or Utopic Dream?
Are data lakes a good thing?
This was the debate going back and forth at the recent Data Summit, held in New York. Interestingly, the roster of speakers – representing a range of industry experts – was sharply divided on the value of data lakes to enterprises. Some saw data lakes — central repositories of raw data that is simply collected, and structured and processed at a later time when needed by an application – as risky business, while others regarded them as the logical way to make the most of the big data tsunami.
In keynote panel discussion early in the conference, Miles Kehoe, search evangelist at Avalon Consulting, and Anne Buff, business solutions manager at the SAS Institute, expressed caution about data lakes. Buff, for one, said data lakes were great technology tools, but didn’t make sense for the business. “I argue vehemently against it,” she said. “Not because it isn’t valuable. From an analytics standpoint it’s a great playground or sandbox because it’s this utopia of putting our data in one place and make it naked, make it raw so we could do whatever we want with it,” Buff said. “But that’s the biggest risk you could imagine. Let’s put every piece of data we ever had in our company in one place, and tell everybody about it.”
The problem, Buff continued, was data security and privacy. The only insurance against this is if an organization has a “good data governance program where people respect data,” as well as certifying individuals. However, Buff continued, such best practices are not very common in enterprises. She referred to the notion of secure data lakes as a “utopic belief that we can get all data in one place.”
Kehoe agreed with Buff, comparing the idea of the data lake to the “x” drive that was part of earlier PC networks. He cautions that organizations may not have enough control over the content of the data being stored within a data lake. “You’re putting stuff there, and you don’t know what it is,” he said. “You may have things that expose you to sexual harassment lawsuits, for example. Can you imagine people copying their files, and shoving it up to a file share somewhere, so it’s publicly available, with no security.”
Some experts say the idea of having such an x drive, or “a big dumb disk,” is fine. “If that’s how it takes to get data there, by all means, put it on that dumb disk,” said David Mariani, CEO of AtScale.
Mariani joined a panel later that day, which I moderated, that also included Wendy Gradek, senior manager with EMC, and Andy Schroepfer, chief strategy officer at Hosting, both of whom expressed great support for the data lake concept.
Data lakes help address the greatest challenge for many enterprises today is disparate data sources, and the inertia it creates within enterprises, said Gradek. “I don’t know how many times I’ve been told the information we need is six months out, or it’s about a year out. That’s not going to work for the business — their goals are very much weekly driven, especially in sales, where if you don’t make your numbers, and you don’t have visibility into your data, you’re running blind.” The key to resolving this supporting disparate data sources in a single enterprise location, she continued. “We need it to be in a central repository in its original state, so when we have those questions we can go to it and apply the logic as close to query time as possible and get what we need quickly.”
Mariani agreed, noting how he “came to a realization that data movement is evil. Data is like water. It’s very expensive and difficult move once it lands someplace.” Today’s data volumes have “grown beyond our ability to pre-process it or to pre-structure it, to build structures to answer questions today.”
Schroepfer stated that it’s better to have data in one place, “as opposed to distributed data sitting on different peoples’ desktops, sitting in different peoples’ Excel spreadsheets. To me, that’s far worse than having a centralized store where you can lock it down and provide access. It’s as good and as clean as you want it to be.”
(Disclosure: the author is a contributor to Database Trends & Applications, published by Information Today, Inc., host of the Data Summit mentioned above.)