The reality in data warehousing is that the primary focus is on delivery. The data warehouse team is tasked with extracting, transforming, integrating, and loading data into the warehouse within increasingly tight timeframes. Twenty years ago, monthly data warehouse loads were common. Ten years ago, weekly loads became the norm. Five years ago, daily loads were called for. Nowadays, near-real-time analytics demands the data warehouse be loaded more frequently than once a day.
The quality of the data in the warehouse determines whether it’s considered a trusted source, but it faces a paradox similar to “which came first, the chicken or the egg?” Except for the data warehouse it’s “which comes first, delivery or quality?” However, since users can’t complain about the quality of data that hasn’t been delivered yet, delivery always comes first in data warehousing.
And with transaction volumes reportedly growing 50-60% each year, data warehouse teams are struggling to keep up with the flow of data, which, in the era of big data, also includes fast-moving large volumes of variously structured data (e.g., social interactions and sensor readings.) Although new prefixes for bytes (giga, tera, peta, exa, zetta, yotta) measure an increase in space, new prefixes for seconds (milli, micro, nano, pico, femto, atto) measure a decrease in time. More space is being created to deliver more data within the same, or smaller, timeframes. Space isn’t the final frontier, time is.
In order to deliver more data in less time, something has to give, and often that something is data quality. When delivery is job one, quality is job none. Obviously, this is an ineffective strategy because either data quality issues are discovered upon delivery, reducing business users’ trust in the data warehouse, or, even worse, no one notices and poor data quality becomes a ticking time bomb that will eventually explode and reek havoc on the organization’s daily business activities.
Overemphasizing data delivery sets the data warehouse team up for being reactive, not proactive, regarding data quality. How does your organization approach the delivery versus quality paradox in data warehousing?
You can listen to my conversation about overcoming data quality challenges in data warehousing with Reuben Vandeventer from CNO Insurance and Sean Crowley from Informatica.
Blogger-in-Chief, Obsessive Compulsive Data Quality
Jim Harris is a recognized thought leader with over 20 years of enterprise data management experience, specializing in data quality and data governance. As Blogger-in-Chief at Obsessive Compulsive Data Quality, Jim offers an independent, vendor-neutral perspective, and hosts the popular audio podcast OCDQ Radio, syndicated on iTunes and Stitcher SmartRadio. Jim is an independent consultant and freelance writer for hire, as well as a regular contributor to Information-Management.com and DataRoundtable.com.