Great Data for Great Analytics – Evolving Best Practices for Data Management
By Philip Russom, TDWI, Research Director for Data Management.
I recently broadcast a really interesting Webinar with David Lyle, a vice president of product strategy at Informatica Corporation. David and I had a “fire-side chat” where we discussed one of the most pressing questions in data management today, namely: How can we prepare great data for great analytics, while still leveraging older best practices in data management? Please allow me to summarize our discussion.
Both old and new requirements are driving organizations toward analytics. David and I started the Webinar by talking about prominent trends:
- Wringing value from big data – The consensus today says that advanced analytics is the primary path to business value from big data and other types of new data, such as data from sensors, devices, machinery, logs, and social media.
- Getting more value from traditional enterprise data – Analytics continues to reveal customer segments, sales opportunities, and threats for risk, fraud, and security.
- Competing on analytics – The modern business is run by the numbers – not just gut feel – to study markets, refine differentiation, and identify competitive advantages.
The rise of analytics is a bit confusing for some data people. As experienced data professionals do more work with advanced forms of analytics (enabled by data mining, clustering, text mining, statistical analysis, etc.) they can’t help but notice that the requirements for preparing analytic data are similar-but-different as compared to their other projects, such as ETL for a data warehouse that feeds standard reports.
Analytics and reporting are two different practices. In the Webinar, David and I talked about how the two involve pretty much the same data management practices, but it different orders and priorities:
- Reporting is mostly about entities and facts you know well, represented by highly polished data that you know well. Squeaky clean report data demands elaborate data processing (for ETL, quality, metadata, master data, and so on). This is especially true of reports that demand numeric precision (about financials or inventory) or will be published outside the organization (regulatory or partner reports).
- Advanced analytics, in general, enables the discovery of new facts you didn’t know, based on the exploration and analysis of data that’s probably new to you. Preparing raw source data for analytics is simple, though at high levels of scale. With big data and other new data, preparation may be as simple as collocating large datasets on Hadoop or another platform suited to data exploration. When using modern tools, users can further prepare the data as they explore it, by profiling, modeling, aggregating, and standardizing data on the fly.
Operationalizing analytics brings reporting and analysis together in a unified process. For example, once an epiphany is discovered through analytics (e.g., the root cause of a new form of customer churn), that discovery should become a repeatable BI deliverable (e.g., metrics and KPIs that enable managers to track the new form of churn in dashboards). In these situations, the best practices of data management apply to a lesser degree (perhaps on the fly) during the early analytic steps of the process, but then are applied fully during the operationalization steps.
Architectural ramifications ensue from the growing diversity of data and workloads for analytics, reporting, multi-structured data, real time, and so on. For example, modern data warehouse environments (DWEs) include multiple tools and data platforms, from traditional relational databases to appliances and columnar databases to Hadoop and other NoSQL platforms. Some are on premises and others are on clouds. On the down side, this results in high complexity, with data strewn across multiple platforms. On the upside, users get great data for great analytics by moving data to a platform within the DWE that’s optimized for a particular data type, analytic workload, or price point, or data management best practice.
For example, a number of data architecture uses cases have emerged successfully in recent years, largely to assure great data for great analytics:
- Leveraging new data warehouse platform types gives analytics the high performance it needs. Toward this end, TDWI has seen many users successfully adopt new platforms based on appliances, columnar data stores, and a variety of in-memory functions.
- Offloading data and its processing to Hadoop frees up capacity on EDWs. And it also gives unstructured and multi-structured data types a platform that is better suited to their management and processing, all at a favorable cost point.
- Virtualizing data assets yields greater agility and simpler data management. Multi-platform data architectures too often entail a lot of data movement among the platforms. But this can be mitigated by federated and virtual data management practices, as well as by emerging practices for data lakes and enterprise data hubs.
If you’d like to hear more of my discussion with Informatica’s David Lyle, please replay the Webinar from the Informatica archive.