Big Data: The Emperor may have their Clothes on but…
Today, I am going to share on what few others have so far been willing to share in public regarding big data. Before doing so, I need to first bring you up to speed on what I have already written on the topic. I shared previously that the term big data probably is not very useful or descriptive. Thomas Davenport, the author of Competing on Analytics, has found as well that “over 80 percent of the executives surveyed thought the term was overstated, confusing, or misleading”. At the same time, the CIOs that I have talking to have suggested that I tell our reps never to open a meeting with the big data topic but instead to talk about the opportunity to relate the volumes of structured data to even larger volumes of unstructured data.
I believe that the objective of these discussions increasingly should be to discover how it is possible to solve even bigger and meatier business problems. It is important, as I said recently; to take a systems view of big data and this includes recognizing that business stakeholders will not use any data unless it is trustworthy, regardless of cost. Having made these points previously, I would like to bring to the forefront another set of issues that businesses should consider before beginning a big data implementation. I have come to my point of view here by listening to the big data vanguard. Many of these early adopters have told me that they jumped onto the big data bandwagon because they heard big data would be cheaper than traditional business intelligence implementations.
However, these enterprises soon discovered that they couldn’t leer away from their Silicon Valley jobs those practiced in the fine arts of HADOOP and MapReduce. They found as well that hand coding approaches and the primitive data collection tools provided by the HADOOP vendors were not ready for prime time and did not by themselves save cost. These early pioneers found that they needed a way to automate the movement and modification of data for analysis. What they determine was needed is an automated, non-script based way to pull and transform the data that populates their HADOOP or other big data type systems. This included real time solutions like Hana and Vertica.
A new architecture for business intelligence
But as I have looked further into the needs of these early adopters, it became clear that they needed an architecture that could truly manage their end to end business intelligence requirements. They needed an architecture that would handle their entire data use lifecycle from the collection, inspection, connection, perfection, and protection of data.
Architecture requires a Data Lake
Obviously, I have already been discussing what could be called the collection phase. But to be clear, big data should be just one element of a larger collection scheme. No one is suggesting for example that existing business intelligence systems be replaced in wholesale fashion with the so called newer approaches. Given this, business architecture needs to start by establishing a data lake approach that over arches the new data storage approaches and effectively sits side by side with existing business intelligence assets.
Data discovery starts by testing data relationships
Once new forms of data are collected using HADOOP or other forms of big data storage within an overarching data lake, users and analysts need to inspect the data collected as whole and surface interrelationships with new and existing forms of data. What is needed in addition to data movement is a lake approach to deploy data and evaluate data relationships. Today, this involves enabling business intelligence users to self-service. One CIO that heard about this lit up and said “this is like orchestration. Users can assembly data and put it together and do it from different sources at different times. It doesn’t just have to be a preconceived process.”
Data Enrichment enables business decision making
Historically users needed to know what data they wanted for analyze prior to building a business intelligence system. An advantage of HADOOP plus an overarching data lake is that you can put data in a place prior to knowing if the data has an interesting business use case or not. Once data is captured, it needs tooling to evaluate and put together data and test the strength of potential data relationships. This includes enabling business users to evaluate the types of analytics that could potentially have value to them. I shared recently on just how important it was to visualize data in a way that culturally fits and derives the most potential business value.
Once data has been evaluated and relevant data relationships have been determined, then it is important to have a way to siphon off data that has been determined to have potential business interest and do what you always did to this data. This includes adding meaningful additional structure and relationship to the data and fixing the quality of data that needs to be related and created within an analytic. This can include things like data mastering. This can mean that one of two things takes place. First is data relationships are extended and data quality and consistency are improved. In this data perfection stage, it can for finance mean integrating and then consolidating data for a total view of the financial picture. For marketing people, it can involving creating an integrated customer record fusing together existing customer master data with external customer datasets to improve cross sell and customer service. With this accomplished it becomes an analysis decision and a cost decision whether data continues to be housed in HADOOP or managed in an existing traditional data warehouse structure.
Valuable data needs to be protected
Once data is created that can be used for business decision making, then we need to take the final step of protecting the data that often cost millions of dollars to create, refine, and analyze. Recently, I was with a CIO and asked about various hacks that have captured so much media attention. This CIO said that the CIOs at the companies that had been hacked were not stupid. It is hard to justify the business value of protecting the value of the data that has been created. It seems clear to me at least that we need to protect data as an asset and as well the external access to it given the brand and business impacts of being hacked.
It seems clear that we need an architecture that is built to last and deliver sustaining value to the business. So here is the cycle again–collection, inspection, connection, perfection, and protection of data. Each step matters to big data but as well to the data architecture that big data is adding onto.
Author Twitter: @MylesSuer