Once you finish your initial assessment, you need to summarize a very long list of potential issues you discovered. This is where you should group the issues to make the presentation of the results more meaningful. For example, you can group items by table—these are X number of issues found affecting Y percent of records per table. Sometimes I group them into the following types of characteristics:
- Is all the requisite information available?
- Are all the address fields populated?
- Are data values missing or in an unusable state?
- Are the phone numbers populated?
- Do all the inpatient claims contain an admission date?
- Are there expectations that data values meet specified formats? If so, do all the values conform to those formats?
- Does the state field contain only two-character codes (CA) or do they include longer abbreviations (Calif.) or full names (California)?
- Do distinct data instances provide conflicting information about the same underlying data object?
- Does the address1 field only contain street address information or is city or state also in the field?
- Are values consistent across data sets?
Does one system use two-character state codes and another use the full name?
- Do interdependent attributes always appropriately reflect their expected consistency?
If the country code is US, then the currency code should be USD.
- Do data objects correctly represent the real-world values they are expected to model?
- Are there transaction dates before the company was founded or dates in the future?
- Are there multiple, unnecessary representations of the same data objects within the data set?
- Are pencil, #2 pencil, and lead pencil the same item with different item numbers?
- Are there multiple patient records for the same patient for the same hospital stay?
- Are there medical procedures with the same description but multiple procedure codes?
- What data is missing important relationship linkages?
- Are there products in the orders that are not in the product catalog?
- Are there procedure codes in the patient record that are not in the authorized procedure table?
- Is the oil well in one system identified by the Well_ID and by the American Petroleum Institute (API) number in another system?
Read the “Characteristics of Data” white paper for a more detailed discussion of this topic.
Putting the anomalies in categories will help build the ROI later. For example, fixing a missing address will cost more to remediate than correcting an inaccurate one or reformatting and standardizing others. However, sometimes your audience will only care about the count of bad phone numbers or duplicate customers. Lastly, it is important that these characteristics are agreed upon by the business because they are the ultimate authority on what constitutes good data quality.
Before you began the assessment, you should have a clear idea on who uses the data and for what purpose. This will help you determine what fields are important. Does the data comply with those characteristics you and the business deem important? Are the phone numbers complete? Are the addresses accurate? Are there duplicate records?
Complete characterization of the data, building scorecards, and the standardization and cleansing will occur after you get the budget for your data quality project. In this phase, you are identifying what is not correct and what kinds of work will be needed to repair the data. However, sometimes building a scorecard of the initial data quality is done during an assessment. A summary of data quality is something executives can grasp quickly.
Read this recent eBook by David Loshin of Knowledge Integrity on Selling the Value of Data Quality to the Business.