Data Lakes and Hidden Undercurrents

Data Lakes Can Have Hidden Undercurrents

Enterprises need to dip their toes cautiously into data lakes – not because the technology isn’t sufficient to support these vast data stores, but because the people and organizations that use them may not be ready.

That’s one of the takeaways from a panel I moderated at the recent Data Summit, held in New York, which explored the issues and opportunities around data lakes. The panelists – Anne Buff, business solutions manager for best practices at SAS Institute; Abhik Roy, database engineer for Experian; and Tassos Sarbanes, mathematician and data scientist for investment banking for Credit Suisse, explored the issues that need to be considered in terms of data lake governance, regulatory compliance, security, and access.

In terms of technology, most enterprises are ready today to adopt or build data lakes, panelists said. “We’re heading in the right direction,” said Sarbanes. “The integration of all various services, processes and toolsets out there really fits the bill well. We hear every day from the open source community about new tools that make it all possible. The sky’s the limit; we’re going to see more coming down the pike from the open source community.”

Buff, who has long been opposed to the data lake concept, agreed that data lakes weren’t an issue because of their technology, but rather, “the people using it,” she remarked. “To me a data is like that coffee can that lots of guys keep in the garage. That has every extra bolt of screw or nail, form very single project, because it could be used some time. For the guys who understand when a screw or bolt should be used, those are okay to have access to that coffee can.”

Buff states that approaches such as data lakes may rise and file, but the important thing is that organizations nurture or hire individuals who are able to evolve with organizational requirements. The key she says, is not to look for specific skillsets, but “answering the question of what your primary mission is. You should be hiring them on the terms that they are really on board with what they’re trying to achieve as far as the company is concerned, so as the technology shifts you need to shift with it.” Data lakes “are just a blip on the radar,” she states, pointing out that new technologies are under development that could employ DNA frameworks to store data, in a fraction of the space now required, making large data stores unnecessary.

The question is, then, will data lakes still have value several years down the road, as the outside internet and semantic web take hold, offering online data storage and analytics resources? Or is some data – such as sensor feeds – so transitory that it needs to be captured and maintained? Sarbanes said data lakes are necessary, as “organizations need to have to their own place to secure safely, keep the data,” said Sarbanes. In his industry, he said, there is “no place for a bank or financial organization to set accounts out in the internet – they would be out of business the next day.” There may be a place for data out in the cloud eventually, but it will take some time, he added.

Buff, who has long been opposed to the data lake concept, says there’s a place for the architecture, but notes that “the biggest red flag is organizations who want to get all their data in one place, they could do things. The idea that collocated data comes integrated is absolutely false,” said Buff. She added that data lakes will work if “it’s it the appropriate place for data capable and well trained people to work and discover new possibilities with data.” However, “it does not mean if you have a data lake, you should open it up to everybody. If they are not trained, very bad things happen.”

Roy urged that business requirements be the first consideration, above and beyond any technology issues. “We need to ask ourselves, what’s the value of the data lake? Then the problem is to try to get very lightly modeled data into that environment. Now is there an opportunity to work on that, or to build models for a specific case, with data, very cheaply.”

Along with governance and skills, ontology is a key requirement for data lake environments, Sarbanes said. “Ontology is going to play a major role within your data lake or data lakes. It’s extremely important, since anyone in compliance going into a data lake to search for something and run analytics will need to look at a data dictionary.”