Tag Archives: schema
Data warehousing systems remain the de facto standard for high performance reporting and business intelligence, and there is no sign that will change soon. But Hadoop now offers an opportunity to lower costs by transferring infrequently used data and data preparation workloads off of the data warehouse and process entirely new sources of data coming from the explosion of industrial and personal devices. This is motivating interest in new concepts like the “data lake” as adjunct environments to traditional data warehousing systems.
Now, let’s be real. Between the evolutionary opportunity of preparing data more cost effectively and the revolutionary opportunity of analyzing new sources of data, the latter just sounds cooler. This revolutionary opportunity is what has spurred the growth of new roles like data scientists and new tools for self-service visualization. In the revolutionary world of pervasive analytics, data scientists have the ability to use Hadoop as a low cost and transient sandbox for data. Data scientists can perform exploratory data analysis by quickly dumping data from a variety of sources into a schema-on-read platform and by iterating dumps as new data comes in. SQL-on-Hadoop technologies like Cloudera Impala, Hortonworks Stinger, Apache Drill, and Pivotal HAWQ enable agile and iterative SQL-like queries on datasets, while new analysis tools like Tableau enable self-service visualization. We are merely in the early phases of the revolutionary opportunity of big data.
But while the revolutionary opportunity is exciting, there’s an equally compelling opportunity for enterprises to modernize their existing data environment. Enterprises cannot rely on an iterative dump methodology for managing operational data pipelines. Unmanaged “data swamps” are simply unpractical for business operations. For an operational data pipeline, the Hadoop environment must be a clean, consistent, and compliant system of record for serving analytical systems. Loading enterprise data into Hadoop instead of a relational data warehouse does not eliminate the need to prepare it.
Now I have a secret to share with you: nearly every enterprise adopting Hadoop today to modernize their data environment has processes, standards, tools, and people dedicated to data profiling, data cleansing, data refinement, data enrichment, and data validation. In the world of enterprise big data, schemas and metadata still matter.
I’ll share some examples with you. I attended a customer panel at Strata + Hadoop World in October. One of the participants was the analytics program lead at a large software company whose team was responsible for data preparation. He described how they ingest data from heterogeneous data sources by mandating a standardized schema for everything that lands in the Hadoop data lake. Once the data lands, his team profiles, cleans, refines, enriches, and validates the data so that business analysts have access to high quality information. Another data executive described how inbound data teams are required to convert data into Avro before storing the data in the data lake. (Avro is an emerging data format alongside other new formats like ORC, Parquet, and JSON). One data engineer from one of the largest consumer internet companies in the world described the schema review committee that had been set up to govern changes to their data schemas. The final participant was an enterprise architect from one of the world’s largest telecom providers who described how their data schema was critical for maintaining compliance with privacy requirements since data had to be masked before it could be made available to analysts.
Let me be clear – these companies are not just bringing in CRM and ERP data into Hadoop. These organizations are ingesting patient sensor data, log files, event data, clickstream data, and in every case, data preparation was the first task at hand.
I recently talked to a large financial services customer who proposed a unique architecture for their Hadoop deployment. They wanted to empower line of business users to be creative in discovering revolutionary opportunities while also evolving their existing data environment. They decided to allow line of businesses to set up sandbox data lakes on local Hadoop clusters for use by small teams of data scientists. Then, once a subset of data was profiled, cleansed, refined, enriched, and validated, it would be loaded into a larger Hadoop cluster functioning as an enterprise information lake. Unlike the sandbox data lakes, the enterprise information lake was clean, consistent, and compliant. Data stewards of the enterprise information lake could govern metadata and ensure data lineage tracking from source systems to sandbox to enterprise information lakes to destination systems. Enterprise information lakes balance the quality of a data warehouse with the cost-effective scalability of Hadoop.
Building enterprise information lakes out of data lakes is simple and fast with tools that can port data pipeline mappings from traditional architectures to Hadoop. With visual development interfaces and native execution on Hadoop, enterprises can accelerate their adoption of Hadoop for operational data pipelines.
No one described the opportunity of enterprise information lakes better at Strata + Hadoop World than a data executive from a large healthcare provider who said, “While big data is exciting, equally exciting is complete data…we are data rich and information poor today.” Schemas and metadata still matter more than ever, and with the help of leading data integration and preparation tools like Informatica, enterprises have a path to unleashing information riches. To learn more, check out this Big Data Workbook
The ability to create abstract schemas that are mapped to back-end physical databases provides a huge advantage for those enterprises looking to get their data under control. However, given the power of data virtualization, there are a few things that those in charge of data integration should know. Here are a few quick tips.
Tip 1: Start with a new schema that is decoupled from the data sources. (more…)
Last month, Informatica and EMC announced a strategic partnership at EMC’s annual user conference in Boston. This is a significant new relationship for both companies-which in itself is interesting. You would have thought that the company responsible for storing more data than just about anybody in the world and the company responsible for moving more data than anybody in the world would have come together many years ago. So why now? What’s different?
Virtualization changes everything. Customers have moved beyond virtualizing their infrastructure and their operating systems and are now trying to apply the same principles to their data. Whether we’re moving the data to the processing, or the processing to the data, it’s clear where data physically lives has become increasingly irrelevant. Customers want data as a service and they don’t want to be hung up on the artificial boundaries created by applications, databases, schemas, or physical devices. (more…)