Matching for Managament: 20 Common Data Errors and Variation

A good friend of mine’s husband is a sergeant on the Chicago police force. Recenlty a crime was committed and a witness insisted that the perpetrator was a woman with blond hair about five nine weighing 160 pounds. She was wearing a gray pinstriped business suit with an Armani scarf and carrying a Gucci handbag.

So what does this sergeant have to do? Start looking at the women of Chicago. He only needs the women. Actually, he would start with women with blond hair (but judging from my daughter’s constant change of hair color he might skip that attribute). So he might start with women in a certain height range and in a certain weight group. He would bring those women in to the station for questioning.

As it turns out, when they finally arrested the woman at her son’s soccer game, she had brown hair, was 5’5″ tall and weighed 120 pounds. She was wearing an Oklahoma University sweatshirt, jeans and sneakers. When the original witness saw her she said yes that’s the same woman. It turns out she was wearing four inch heels and the pantsuit made her look bigger.

So what can we learn from this episode that has to do with matching? Well the first thing we need to understand is that each of the attributes of the witness can be used in matching the suspect and then immediately we must also recognize that not all the attributes that the witness gave the sergeant were extremely accurate. So later on when we start talking about matching, will use the term fuzzy matching. This means that when you look at an address, there could be a number of different types of errors in the address from one system that are not identical to an address in another system. Figure 1 shows a number of the common errors that can happen.


Figure 1 – 20 Common Data Errors

The next concept that we will discuss is grouping (sometimes referred to as blocking). When the sergeant needed to interview the suspects, he needed to get the list of suspects down to a manageable number. Otherwise he would spend weeks interviewing potential suspects. This concept is called grouping. This is where we want to narrow down the number of records that will actually be compared in the database. This is so we’re not spending days or weeks comparing every record against every other record. Especially if you have hundreds of millions of customer or product records.

Now, when the sergeant started interviewing individuals from the suspect group, he was only interviewing those suspects that had a potential to match the description of the perpetrator. After each interview the sergeant would go over his notes and make a mental calculation as to what the probability was that this suspect could be the perpetrator. In some cases the interviewed suspect would have an alibi. This would totally disqualify that person from further consideration of being a suspect. This is the next concept which is comparing and scoring the records in the group selected from the database. This concept is called matching.

This entry was posted in Data Quality and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>