The growth of big data drives many things, including the use of cloud-based resources, the growth of non-traditional databases, and, of course, the growth of data integration. What’s typically not as well understood are the required patterns of data integration, or, the ongoing need for better and more innovative data cleansing tools.
Indeed, while writing Big Data@Work: Dispelling the Myths, Uncovering the Opportunities, Tom Davenport observed data scientists at work. During his talk at VentureBeat’s DataBeat conference, Davenport said data scientists would need better data integration and data cleansing tools before they’d be able to keep up with the demand within organizations.
But Davenport is not alone. Most who deploy big data systems see the need for data integration and data cleansing tools. In most instances, not having those tools in place hindered progress.
I would agree with Davenport, in that the number one impediment to moving to any type of big data is how to clean and move data. Addressing that aspect of big data is Job One for enterprise IT.
The fact is, just implementing Hadoop-based databases won’t make a big data system work. Indeed, the data must come from existing operational data stores, and leverage all types of interfaces and database models. The fundamental need to translate the data structure and content to effectively move from one data store (or stores, typically) to the big data systems has more complexities than most enterprises understand.
The path forward may require more steps than originally anticipated, and perhaps the whole big data thing was sold as something that’s much easier than it actually is. My role for the last few years is to be the guy who lets enterprises know that data integration and data cleansing are core components to the process of building and deploying big data systems. You may as well learn to deal with it early in the process.
The good news is that data integration is not a new concept, and the technology is more than mature. What’s more, data cleansing tools can now be a part of the data integration technology offerings, and actually clean the data as it moves from place to place, and do so in near real-time.
So, doing big data anytime soon? Now is the time to define your big data strategy, in terms of the new technology you’ll be dragging into the enterprise. It’s also time to expand or change the use of data integration and perhaps the enabling technology that is built or designed around the use of big data.
I hate to sound like broken record, but somebody has to say this stuff.