Little Kids, Little Problems, Big Kids, Big Problems
I was talking with a colleague recently about our kids and we were sharing stories about the challenges of raising them. While Jim is starting out on the journey (his kids are 1 and 3), I am nearing the end of the journey (I hope), with two young adults – 18 and 21. During our conversation, I was reminded of a saying “Little Kids, Little Problems, Big Kids, Big Problems” and the different parenting styles we adopt as our kids grow up. This got me thinking, are there similarities with little data quality (is there such a thing!) and big data quality? Do we need a different approach for each?
With big data projects becoming more prevalent, they are having a bigger impact on many organizations performance. And like growing kids, the bigger the data, the bigger the potential for data quality problems. Consider a one percent error in 1 million records, that equals ten thousand bad records. Now consider 1 billion records, that equals 10 million bad records. That’s a real big problem. These big data quality problems present themselves in many ways:
Eroded Confidence – questions may be asked as to the veracity of the data and users will avoid using systems relying on the output of their big data environment until they can be reassured of the quality of the data
Increased Inefficiencies – duplication of effort and reworking the data means less time testing assumptions, gaining insights and innovating
Flawed Decisions – making a bad decision not only impacts an organization’s bottom line it can have a negative impact at an individual level, for example, decisions around health care or allowing driverless cars on the road that make choices based on bad data
Back to my question, like parenting small and big kids, do we need a different approach for big data quality? The answer is yes. Unlike ‘small data’, the volumes, complexity, and speed of data entering your big data environment are an order of magnitude that makes the job of cleaning it all impractical, and the returns may be marginal. So what approach can you take for big data quality:
Decide what to clean – you may not want to clean certain elements as it will lose its meaning, or you may decide the data is good enough to spot general trends
Automate the process – provide business users and data scientists with prebuilt data quality rules and apply artificial intelligence were possible so they can understand the nature of the data, identify problems and take remedial action
Standardize and re-use – deploy data quality services so common data quality rules can be managed centrally, optimized for specific data domains and shared across the organization
Continuously monitor – as new data flows in, you need to profile and measure the quality, providing business and IT with a clear understanding of any potential issues so they can respond accordingly
Taking the right approach to data quality in a big data environment ensures that the data meets the correct level of quality for the context it is being used for. And like taking the right approach to parenting big kids, you will be confident in their choices and the decisions they make in life.