Data Prep: Getting Value Out of Big Data Goes Beyond Cleansing
Many organizations experience frustration and disappointment when they first dive into “ big data ”. The idea of getting industry-shaking insights from the massive troves of data now available is compelling, but you have to have the right data—and you have to treat it right—to get business-changing results.
ZDNet recently tackled the issue in an article titled “Big Data’s Biggest Problem: It’s Too Hard to Get the Data In.” The article approaches the issue from a high level, and looks at the challenge of getting the data in, and making some initial sense of it. It’s a good executive-level introduction, but I think that it’s worth digging deeper to truly understand the issues around data preparation.
Ingesting the data is not the hard part. That can be done that quite handily. The challenge is preparing the data for use—for the wide variety of current and future needs—in an efficient and effective way. It goes beyond just “cleansing” the data.
The art of data preparation—for both structured and unstructured data
As the ZDNet article puts it, the most common problem organizations are looking to address is to deliver actionable insights through “the marriage of structured data (your company’s proprietary information) with unstructured data (public sources such as social media streams and government feeds).”
But we can’t just assume that structured data is perfect and ready to go. It may need some work. First you have to capture and preserve the data structure as you move it into your Hadoop environment, probably with Hive. Next, you’ll probably need to bring along your metadata, particularly business metadata around critical terms, semantic meaning, and business context. If you don’t know what the data means, you’re pretty much dead in the water.
Finally, you might want to do some data cleansing. But what exactly does that entail?
- Finding and fixing any missing values.
- Finding and fixing inconsistent formats.
- Removing and resolving duplicate data.
- Doing some data enhancement by adding data or calculating new values from existing.
And that’s just your structured data. Now, let’s talk about the unstructured data. You can just dump it “as is” into a Hadoop cluster. The problem, as ZDNet notes, is that data professionals are spending “50% to 90% of their time cleaning up raw data and preparing to input it into the company’s data platforms.” Better to automate more of the data prep so it’s ready for your professionals to work with. What do you need to do to accomplish this?
First you need to parse the data for its structure. You may be able to buy parsers that understand the specific data format you are working with, or you can get parsing tools that you “teach” to recognize the patterns in the data. Both of these are orders of magnitude more productive than doing it by hand.
The goal here is to relate different data sets together to find insights. One approach is data tagging. . Tagging data before handing it off to Hadoop for analysis can save an enormous amount of time. For example data from your many sales and marketing automation tools can be tagged with a campaign code to help make it clear who did what and when.
Then you need to cleanse the data. Not all data needs to be perfect. In fact, “pretty good” data might be enough. You need data at two levels of quality:
- Pretty good data that lets you test basic hypotheses, such as looking for a cause-and-effect between a marketing activity and sales.
- Great data fuels critical business decisions, and customer relationships. (You don’t want to contact a customer without first understanding their history with and value to your business. Customers expect that.)
The trick is to process the data “just enough” to make it useful when loading into a big data environment (Hadoop, NoSLQ, Columnar, etc). You need to be able to discover, access, and join it with other data. Later, if you’re looking to operationalize a useful analysis, it may be useful to do more data prep work at that time.
It all starts with strategy
The management of your data—from ingestion to prep, transformation to storage—has to flow from a comprehensive strategy. We’ve been addressing data prep as a frequently neglected part of that strategy, but it’s important to remember the whole picture:
- Make sure you have an overall data management strategy—one that aligns with the business strategy.
- Make sure that your data strategy and tools are the same regardless of the types of data you are using. This is critical for productivity and to avoid data silos.
- Look for a data management platform that can bring all of these capabilities to the big data world: data integration, data quality and governance, and data security.
The right strategy should lead you to the right tools, platforms and personnel, and help you get more of the business-enhancing results big data promises.
For more on how Informatica helps you get real value from all your data, visit our Big Data Ready Page.