Data Integration - Informatica

Informatica Data Quality

Matching: Determinism and Probability in a new context

Chris McCauley

I wanted to follow up on my comments about technical versus business metadata and had intended to make this article about metadata, but instead I'm going to kill two birds with one stone and talk about approaches to matching. Why? Well recently I found myself getting dragged into a very old argument about Probabilistic versus Deterministic matching. Normally I'm very happy to get into any techie argument and the more pointless the better, but this one has been done to death so I feigned ignorance and escaped with a new idea for a blog entry.

The argument was about whether a deterministic or probabilistic approach to matching is better. After six years in data quality technology, I'm happy enough to argue for either approach and given a strict choice between "A" and "B" can make a convincing case for option "C."
The remainder of this blog entry could be summarized as follows ….

Probabilistic matching has some very interesting technical advantages over Deterministic matching but each on its one is insufficient to cope with the messy reality of how to deal with real-world matching problems involving heuristics or business rules. Only a hybrid of either or both, with a human-friendly rules management system, will really cut the mustard.

Now for the more long-winded answer …

Both approaches are trying to achieve the same thing, i.e. finding whether or not two pieces of text match each other to some acceptable level. In the data quality world we are usually interested in matching relatively short sequences of text; a person's name, an address or a product or asset description etc. Very often we are trying to determine if the two pieces of text refer to the same thing - the same person, the same product or the same asset.

Both the Deterministic and Probabilistic approaches use deterministic (note the lowe-case "d") algorithms to evaluate the similarity between text. What do we mean by deterministic? Simply that given the same inputs, the same result is reached. Generally it's considered a good thing that 2 + 2 (always) = 4 and not, on occasion, a suffusion of yellow.

Both approaches use various algorithms to measure the similarity between two pieces of text. Different algorithms define similarity in different ways and have different strengths or weaknesses. Some just execute more quickly than others or are better at one particular problem. For example the Jaro Distance algorithm is more tolerant of simple character transpositions than other algorithms and will forgive simple spelling mistakes. A robust matching process will use several algorithms possibly targeted at different parts of the text (e.g. one for the name components, one for the address components) and then combine them in some way to determine if the texts match.

So why is one approach called Deterministic and the other Probabilistic if they are so similar? The answer is that Probabilistic systems take into account not just the text itself but also some features of the overall search space. These additional inputs represent probability measurements derived from an analysis of, for example, all of the names in your customer database.

Consider a simple example such as house-holding; "Murphy" is a common family name in Ireland but "Morrata" isn't. Given the context, it's pretty obvious that two people in Ireland with the family name Morrata are more likely to be related than people with the more popular name Murphy. Knowing something about the overall search space provides insights that are not so obvious by restricting ourselves to the inputs only.

In defence of Deterministic systems, the analysis of a large search space is not trivial and is only valid if the search space is relatively static or if regular "retraining" is performed. Finally, the results obtained from a Probabilistic system may well be the same as those from a Deterministic system.

You can imagine the years of fun that some folks must have had debating the benefits of one approach over the other. Yet in reality, neither a purely Deterministic nor a purely Probabilistic system is sufficient for solving complex matching problems: the degree to which the user can influence the matching process by adding business rules, product support for Unicode and the ability to match on a variety of data types are all more important. Still it is good to argue sometimes.

On a final point, it's interesting to note that the perceived strength of Probabilistic systems is that they take into consideration more of the context in which the match is being performed than Deterministic systems. The ability to appreciate the business context in which data quality processes operate will distinguish future products in this space and is an aspect of active research in Informatica.

No Comments, Comment or Ping

Reply to “Matching: Determinism and Probability in a new context”