Tag Archives: bigdata
People are obsessed with data. Data captured from our smartphones. Internet data showing how we shop and search — and what marketers do with that data. Big Data, which I loosely define as people throwing every conceivable data point into a giant Hadoop cluster with the hope of figuring out what it all means.
Too bad all that attention stems from fear, uncertainty and doubt about the data that defines us. I blame the technology industry, which — in the immortal words of “Cool Hand Luke” — has had a “failure to communicate.” For decades we’ve talked the language of IT and left it up to our direct customers to explain the proper care-and-feeding of data to their business users. Small wonder it’s way too hard for regular people to understand what we, as an industry, are doing. After all, how we can expect others to explain the do’s and don’ts of data management when we haven’t clearly explained it ourselves?
I say we need to start talking about the ABC’s of handling data in a way that’s easy for anyone to understand. I’m convinced we can because — if you think about it — everything you learned about data you learned in kindergarten: It has to be clean, safe and connected. Here’s what I mean:
Data cleanliness has always been important, but assumes real urgency with the move toward Big Data. I blame Hadoop, the underlying technology that makes Big Data possible. On the plus side, Hadoop gives companies a cost-effective way to store, process and analyze petabytes of nearly every imaginable data type. And that’s the problem as companies go through the enormous time suck of cataloging and organizing vast stores of data. Put bluntly, big data can be a swamp.
The question is, how to make it potable. This isn’t always easy, but it’s always, always necessary. It begins, naturally, by ensuring the data is accurate, de-deduped and complete.
Now comes the truly difficult part: Knowing where that data originated, where it’s been, how it’s related to other data and its lineage. That data provenance is absolutely vital in our hyper-connected world where one company’s data interacts with data from suppliers, partners, and customers. Someone else’s dirty data, regardless of origin, can ruin reputations and drive down sales faster than you can say “Target breach.” In fact, we now know that hackers entered Target’s point-of-sales terminals through a supplier’s project management and electronic billing system. We won’t know for a while the full extent of the damage. We do know the hack affected one-third of the entire U.S. population. Which brings us to:
Obviously, being safe means keeping data out of the hands of criminals. But it doesn’t stop there. That’s because today’s technologies make it oh so easy to misuse the data we have at our disposal. If we’re really determined to keep data safe, we have to think long and hard about responsibility and governance. We have to constantly question the data we use, and how we use it. Questions like:
- How much of our data should be accessible, and by whom?
- Do we really need to include personal information, like social security numbers or medical data, in our Hadoop clusters?
- When do we go the extra step of making that data anonymous?
And as I think about it, I realize that everything we learned in kindergarten boils down to down to the ethics of data: How, for example, do we know if we’re using data for good or for evil?
That question is especially relevant for marketers, who have a tendency to use data to scare people, for crass commercialism, or to violate our privacy just because technology makes it possible. Use data ethically, and we can help change the use.
In fact, I believe that the ethics of data is such an important topic that I’ve decided to make it the title of my new blog.
Stay tuned for more musings on The Ethics of Data.
Data Warehouse Optimization (DWO) is becoming a popular term that describes how an organization optimizes their data storage and processing for cost and performance while data volumes continue to grow from an ever increasing variety of data sources.
Data warehouses are reaching their capacity much too quickly as the demand for more data and more types of data are forcing IT organizations into very costly upgrades. Further compounding the problem is that many organizations don’t have a strategy for managing the lifecycle of their data. It is not uncommon for much of the data in a data warehouse to be unused or infrequently used or that too much compute capacity is consumed by extract-load-transform (ELT) processing. This is sometimes the result of business requests for one off business reports that are no longer used or staging raw data in the data warehouse. A large global bank’s data warehouse was exploding with 200TB of data forcing them to consider an upgrade that would cost $20 million. They discovered that much of the data was no longer being used and could be archived to lower cost storage thereby avoiding the upgrade and saving millions. This same bank continues to retire data monthly resulting in on-going savings of $2-3 million annually. A large healthcare insurance company discovered that fewer than 2% of their ELT scripts were consuming 65% of their data warehouse CPU capacity. This company is now looking at Hadoop as a staging platform to offload the storage of raw data and ELT processing freeing up their data warehouse to support the hundreds of concurrent business users. A global media & entertainment company saw their data increase by 20x per year and the associated costs increase 3x within 6 months as they on-boarded more data such as web clickstream data from thousands of web sites and in-game telemetry data.
In this era of big data, not all data is created equal with most raw data originating from machine log files, social media, or years of original transaction data considered to be of lower value – at least until it has been prepared and refined for analysis. This raw data should be staged in Hadoop to reduce storage and data preparation costs while the data warehouse capacity should be reserved for refined, curated and frequently used datasets. Therefore, it’s time to consider optimizing your data warehouse environment to lower costs, increase capacity, optimize performance, and establish an infrastructure that can support growing data volumes from a variety of data sources. Informatica has a complete solution available for data warehouse optimization.
The first step in the optimization process as illustrated in Figure 1 below is to identify inactive and infrequently used data and ELT performance bottlenecks in the data warehouse. Step 2 is to offload the data and ELT processing identified in step 1 to Hadoop. PowerCenter customers have the advantage of Vibe which allows them to map once and deploy anywhere so that ELT processing executed through PowerCenter pushdown capabilities can be converted to ETL processing on Hadoop as part of a simple configuration step during deployment. Most raw data, such as original transaction data, log files (e.g. Internet clickstream), social media, sensor device, and machine data should be staged in Hadoop as noted in step 3. Informatica provides near-universal connectivity to all types of data so that you can load data directly into Hadoop. You can even replicate entire schemas and files into Hadoop, capture just the changes, and stream millions of transactions per second into Hadoop such as machine data. The Informatica PowerCenter Big Data Edition makes every PowerCenter developer a Hadoop developer without having to learn Hadoop so that all ETL, data integration and data quality can be executed natively on Hadoop using readily available resource skills while increasing productivity up to 5x over hand-coding. Informatica also provides data discovery and profiling tools on Hadoop to help data science teams collaborate and understand their data. The final step is to move the resulting high value and frequently used data sets prepared and refined on Hadoop into the data warehouse that supports your enterprise BI and analytics applications.
To get started, Informatica has teamed up with Cloudera to deliver a reference architecture for data warehouse optimization so organizations can lower infrastructure and operational costs, optimize performance and scalability, and ensure enterprise-ready deployments that meet business SLA’s. To learn more please join the webinar A Big Data Reference Architecture for Data Warehouse Optimization on Tuesday November 19 at 8:00am PST.
Figure 1: Process steps for Data Warehouse Optimization