I recently had the pleasure of participating in a big data panel at the Pacific Crest investor’s conference (the replay available here.) I was joined on the panel by Hortonworks, MapR, Datastax and Microsoft. There is clearly a lot of interest in the world of big data and how the market is evolving. I came away from the panel with four fundamental thoughts:
Big data is so much more than just Hadoop. It includes:
- The vastly growing volume of traditional data we have processed for 40 years, the need for traditional data integration.
- The cloud, and the need to provide cloud data integration.
- The hugely expanding volumes of new types of data such as social data, and the need to provide social data integration.
- The enormous volumes of machine data, and the need to provide machine data integration.
- The use of innovative new analytical capabilities such as Hadoop, and the need to provide Hadoop data integration.
The world of Hadoop is diverging
- As witnessed on the panel by the strong debate between Hortonworks and MapR over the role of proprietary enhancements to the different distributions.
Big Data is all about data integration
- The words “data and “integration” were re-iterated by each panelist reinforcing the same message again and again.
Big Data is about augmenting existing data
- Microsoft summarized the situation perfectly by stating that the new enterprise will have both structured data and unstructured data.
- Enterprises will need the accuracy of order processing correlated with the probabilistic analysis of sentiment.
- Enterprises will require a single view of customer related to the relationships across the social network to true have a complete view of their clients.
Hadoop is a great example of pure innovation. Developed by a new breed of organizations to provide a low-cost and scalable analytics engine it will proving to be extremely enticing to enterprises that are looking at new ways of analyzing data. We are in the early stages of this brave new world – recent research from the 451 group and Jaspersoft suggests that just 18% [of enterprises] are using Hadoop HDFS at this point, although 68% are already using machine-generated content (web logs, sensor data) as the source for their big data projects, and 46% are using human-generated text (social media, blogs).
As with all new innovations, people will begin to look at evolving the use-cases beyond that originally designed. With new programming languages, such as MapReduce, it is now possible to write hand-coding for all sorts of process requirements. However, hand-coding is still hand-coding and brings with it all the complexity, cost and down-stream headaches of maintenance and integration hairballs that we have grown accustomed to in the current world of IT.
However, the good news is that the innovation of Hadoop for analytics is being matched by the innovation of Informatica in removing hand-coding. Informatica is committed to helping enterprises learn the lessons of the last 40 years and deliver a code-less environment for ETL into Hadoop, and ETL on Hadoop. We are already seeing great examples of enterprises augmenting traditional structured data analysis with new types of machine data, including U.S. Xpress, as well as adoption of complex parsing within Hadoop using HParser such as eHarmony
In my next blog I will look in more details at the innovation of Hadoop as a new analytics engine and the thriving role of code-less ETL both into and within Hadoop.