Let’s take a look at some typical customer records that we’re about to bring into a master data management system. In table 1 you see a couple of customer records from different systems.
As you can see there are different variations of the customer’s name. Also if we look at the address you can see again variations of the way the address is recorded in the record. In some instances the building number is expressed as a digit while in others it is expressed as a word. Also in one record the entire address is in one field while in others it is broken out by city, state and zip as well as a street address.
We can also see that the phone number field is in different formats and missing in two records. You will notice that in some instances the date of birth is not there as well as the gender. Normally I would clean up this data before I would attempt to do any matching. However just to show how powerful Informatica matching capabilities are, I will attempt to match these records without any data cleansing whatsoever.
In table 2, I ran a quick match on the Name and Addr1 fields. The two columns on the left show a cluster number and a score. The records with the same cluster number are records the software considers a match. In the score field is a probability of that match. So in the first two records J. W. Namath and Joe Willie Namath are in the same cluster. We consider those are a match because they’re both at the address of one Broadway. The first record in the cluster has a score of 1. The first record in the group will always have a score of 1. A score of 1 in subsequent records would indicate a perfect match with the 1st record in the cluster.
In the second cluster of records, Joe Namath, JJ Namath and Joseph Namath are considered a match. However if you look closer at the records you can see that they have different birth dates. I happen to know that in this example Jennifer Josephine Namath and Joseph Namath are twins and their father is Joe Namath. So what I need to do is separate those records as they are not the same person.
What I will do is add date of birth to the match criteria. As you can see in table 3, I have successfully broken the match between the father and the children. However since they were twins they were born on the same day and therefore they are still matching. So in order to break the match now between the twins I will add the gender column and test my results.
So now we can see in table 4, the match between the brother and sister has been broken. However, we are still missing a match because Joe Namath got married and moved from New York to Brooklyn. So now we have to devise a way to catch those matches as well.
What other attributes do we have to work with? The only one that we have not used yet that could potentially add matches is the phone number. When the phone company introduced number portability they did not realize that they created a wonderful attribute to help us identify people who have moved.
We now run a match on the name field and the phone number field. This match brings together three records across different addresses. These are illustrated in table 5. However, because the phone number is not fully populated we are still missing matches. But the interesting fact to note is that some of the matches we got before are no longer matches here. So what we can do is associate the matches from the name and address match with those in the name and phone match.
By using a simple transformation in Informatica’s data quality product, you can see in table 6 that all of the Joe Namath records that should be in the same cluster have been identified by a new association ID with this technique.
There is one additional concept we need to discuss. It is the concept of removing duplicates or de-duping. Duplicates can exist for a number of reasons. The two most common are a lack of duplicate checks at the time of record insertion and the second is when multiple files are brought together from multiple systems. You will hear multiple terms used to describe the record that persists from the group of identified duplicates, survivor record, master record, golden record or best version of the truth are some of the common ones. This process of reducing the group of identified duplicates to one record is called consolidation. As we have seen already we can match multiple customer records either from within a single system or from different systems using different match criteria within the same process. When two records are identical, the duplicate can simply be removed (assuming you are allowed either by law or by company standards). When there is different or missing data in one or more of the fields, the approach to resolving the duplicate is a little more complex. The second concept is one we’ve already been dealing with. That is where we are bringing records from multiple systems together with the intent of creating a single golden record or master record.
There are two methods of performing consolidation of the identified duplicate records into a single master record. The first method is programmatically determining which fields from each of the records to use in the master record. One example of a consolidation rule would be to take the most frequent non-blank value and insert it into the master record. If you look at the state column in table 6 for association group 2 you would see ‘NY’, ‘New York’ and null. The abbreviation ‘NY’ is the most frequent non-blank field for that group. Another example rule might be to take the longest value. Depending on which of these rules was used, you would either end up with either ‘NY’ or ‘New York’ in the state column for association group 2 in master record.
The other method of consolidation is via a clerical review, also referred to as a manual method. This is where the cluster is sent to a staging database where a business analyst would selectively pick the best values from each cluster to build a master record. So rather than a program making a selection someone would be visually inspecting the clusters and manually selecting the best value to be put into a master record.
And actually you can combine the two methods. For example when the match score is .90 or greater you can programmatically determine what is the master record. However if the score is between .90 and .80 you can send these clusters to a staging database and have the business analyst manually review the clusters. If the analyst determines these are the same customer, she can selectively pick which value from each of the records in the cluster would become part of the master record.
You can see in figure 2, note the bold green arrows. We have manually selected the different field values from different records to build the final record at the bottom.