Asset in, Garbage Out: Measuring Data Degradation
Republished, by popular demand
Following up on the discussion I started on GovernYourData.com (thanks to all who provided great feedback), here’s my full proposal on this topic:
We all know about the “Garbage In/Garbage Out” reality that data quality and data governance practitioners have been fighting against for decades. If you don’t trust data when it’s initially captured, how can you trust it when it’s time to consume or analyze it? But I’m also looking at the tougher problem of data degradation. The data comes into your environment just fine, but any number of actions, events – or inactions – turns that “good” data “bad”.
So far I’ve been able to hypothesize eight root causes of data degradation. I’d really love your feedback on both the validity and completeness of these categories. I’ve used similar examples across a number of these to simplify.
In no particular order, these include data degradation due to:
- Age/Point in time data. The data may have been 100% accurate upon capture, but things change over time. If the data doesn’t include a context as to when it was relevant, it becomes useless. There are innumerable examples here, some include:
- Contact information. Postal addresses, phone numbers, emails, IP addresses – they can all frequently change over time for any given contact.
- Individual’s age (vs. best practice capture method – date of birth)
- Organization revenue or headcount. $1 billion in Revenue? Great! Last year? 10 years ago?
- “Active” status for former customer. Inefficient if the customer defected to a competitor. Potentially embarrassing and damaging if the customer passed away.
- Hierarchy/relationship data. Subsidiaries get spun off. Products within a product family get discontinued. Household members move out, get divorced, off to college, etc.
- Formerly unique records can have duplicate records subsequently created.
- Change in information structure. Formerly free form text now captured in structured or hierarchal form (e.g., XML).
- Did the parsing, entity extraction or natural language processing you leveraged – be it automated or manual – lead to inadvertent misclassification of data? (e.g., First name inadvertently put in Last name field and vice versa; hire date put into birth date column).
- In this scenario, the quality of free form text wasn’t necessarily “high”, but being unstructured there was very little implicit expectation of quality or usefulness. But the act of parsing and extracting information into structured data fields puts the implication that the data field (and accompanying metadata) should carry relevant data.
- Metadata and data modeling lapses.
- Right data is put in wrong context (e.g., First name put in Last Name field)
- Data classifications not updated to reflect changing taxonomies and metadata standards
- Similar data modeled and stored in different ways (ID stored as numeric in one database, text in another; date stored as text in one, date as another, etc)
- Ungoverned human and system processes that update or change data.
- Upstream applications allow non-validated updates to previously validated data (e.g., product registration website accepts lower confidence postal address updating high quality bill to address captured via an order)
- Data migration during an application consolidation or upgrade doesn’t properly map source data to target schema (e.g., First name put in Last Name field)
- Undocumented or non-communicated changes to upstream source data schema, capture rules, or policies impact usefulness of downstream processes. (e.g., opt out Yes/No flag for marketing communications added to upstream CRM system, but downstream marketing campaign data warehouse not informed)
- Automated business rules, transformations or updates effectively “alter” the data from a trusted state. In other words, DI, DQ and MDM rules gone bad! (E.g., a match/merge rule incorrectly creates a false positive linking two distinct patient records; the Legal name of “Parkway Associates” changes to “Pkwy Associates” through improper use of reference data standardization logic)
- Change in DQ perceptions. Data is the same, but tolerance for prior DQ thresholds change. Examples:
- 20% dupes was fine yesterday when marketing was the only ones who used the data, but now finance requires we reduce it to 10% for reporting to the street
- Email used to be optional, but now it’s required
- Proliferation (ungoverned copies). The more copies of data made, the less likely its quality and security can be monitored and controlled across all repositories.
- Broken relationships. Data is accurate, but related data – transactions, interactions, master data, reference data – that provides context/relevance is archived, purged, or updated. Examples:
- Orphan customers or contacts not linked to any transactions
- Active SKU not linked to any product hierarchy
- Human intervention – mischief/incompetence/innocent mistakes
- Deliberate data entry errors, updates by customers, employees or partners for the purpose to deceive, mask illicit activity, or defraud. Customers may provide bad data to protect privacy or avoid spam abuse.
- Accidental data entry errors due to swivel-chair integration, poor training, poor app or website user experience, etc.
Okay, fun laundry list and all, but what’s the value of this exercise? Those of us that have worked in the data quality profession know that so much prioritization and budget focuses on after-the-fact clean-up of data after it has soured, with too little focus invested in ensuring it remains valuable and trustworthy in the first place. That’s where an effective Data Governance program can help – especially if it is permitted to focus on all the dependent processes that sustains your data lifecycle.
I’m looking forward to hearing your thoughts on this – and if you’ve been able to effectively mitigate any of these scenarios.