The Easy Button
Basic data profiling, while significantly automating the process, is still a manually intensive effort to perform the detailed analysis. For example, normally you have to go through and select the table or tables that you want to participate in the analysis; configure the profile; and then run the profile, etc. At Informatica, we are going beyond basic data profiling other ways. The first is what I jokingly refer to as the Staples™ easy button. A new feature built on our advanced profiling is called Enterprise Discovery. This feature allows you to point at a schema or schemas and selectively run column profiling, primary key profiling, foreign key profiling and data domain discovery. So with a few clicks of the mouse you can run all your profiling requirements in one step against hundreds or thousands of tables. (more…)
This blog discusses going beyond basic data profiling. But for those of you who don’t know what basic data profiling is, let me summarize quickly. Basic data profiling is what I call three-dimensional analysis. I discuss this in some depth in my book, “Three Dimensional Analysis Data Profiling Techniques.”
Basic profiling includes column profiling, table profiling, and cross table profiling. Column profiling is automated discovery of the true metadata of your data. This is the process of identifying the accurate data type and precision, minimum value, maximum value, the number of nulls, percent null and more on a column by column basis. Table profiling attempts to infer a primary key based upon the data. You can also infer functional dependencies within each table. Cross table profiling is about finding primary key / foreign key relationships between the tables, as well as overlap analysis. This technology has been around since the late 90s. It boggles my mind that many data related projects still do not perform basic profiling before they embark on the project. But that, as they say, is another story. (more…)
So now that you understand the terminology and concepts let’s talk about business problems that can be addressed with this technology.
Inability to get to single view of customer because of matching issues
In the examples above, you can see where it can be a challenge getting the correct customer records into a single cluster. If you do not get all the same customer records together properly, you may not be treating particular customers appropriately. One example is not identifying your top customers because they are represented by multiple account numbers. Worse can be treating a very good customer poorly because you think they had only had one small transaction with you but in reality he just did not log in or use his frequent shopper card. This poor service could jeopardize the entire account. (more…)
Let’s take a look at some typical customer records that we’re about to bring into a master data management system. In table 1 you see a couple of customer records from different systems.
A good friend of mine’s husband is a sergeant on the Chicago police force. Recenlty a crime was committed and a witness insisted that the perpetrator was a woman with blond hair about five nine weighing 160 pounds. She was wearing a gray pinstriped business suit with an Armani scarf and carrying a Gucci handbag. (more…)
Finally, you need to create a business case and present the finding of the data quality checkup. There are two levels of presentation that typically take place after the data quality assessment. The first is a technical presentation to IT giving all the details of completeness, conformity, consistency, accuracy, duplication, and integrity characteristics of the data. IT needs to understand the types of issues in order to figure out what needs to be repaired and have an idea what can be fixed and what it might cost.
The more important presentation is what impact these issues are having on the business. Does the lack of accuracy in the data affect the accuracy of business decisions? How does the completeness of the data affect insurance ratings, loan applications, or well drilling decisions? Are your customer’s committing a crime? (more…)
Once you identify the many data anomalies, you need to work with the business to quantify the business impact. If you can’t determine the impact on the business, it either has no impact or you are talking to the wrong people. If you can’t determine the impact, you might as well stop right there or find another area to look at:
No impact = No reason to fix it = No money (more…)
Once you finish your initial assessment, you need to summarize a very long list of potential issues you discovered. This is where you should group the issues to make the presentation of the results more meaningful. For example, you can group items by table—these are X number of issues found affecting Y percent of records per table. Sometimes I group them into the following types of characteristics:
- Is all the requisite information available?
- Are all the address fields populated?
- Are data values missing or in an unusable state?
- Are the phone numbers populated?
- Do all the inpatient claims contain an admission date? (more…)
Now comes the fun part, inspecting the data. For this step, automated data profiling will help you identify actual problems with the data as they relate to business client expectations. Here are just a few possible issues:
- Are the phone numbers empty?
- Are the admission dates missing in inpatient hospital claims?
- Are there car loans with durations greater than 10 years?
- Do shipping records lack corresponding billing records?
- Do product descriptions differ only slightly?
- Are you delivering products to many different customers with the same address?
- What business rules are being violated? (more…)
Let’s look at the steps in more detail for building a business case for data quality using the bottom-up approach. Where do you start? You need to find a sponsor—someone who instinctively knows there is a problem and wants help in quantifying it. Marketing knows it has duplicate customer records and wants to get a better handle on them. You should look at these systems or business processes that work with the customer data. You must assess how the data in these systems is used within marketing. For example, what is the data used for, what critical decisions are made based on this data, and how many people use it to make decisions? The more users or the more critical the decision, the more likely this data is a candidate for evaluation. Also look at more than the initial decision support system and data. Look at any systems that get data from the decision support system. Data flow diagrams are always helpful in assessing this but usually difficult to find. (more…)