Even in “good” data there is a lot of garbage. For example a person’s name. John could also be spelled as Jon or Von (I have a high school sports trophy to prove it). Schmidt could become Schmitt or Smith. In Hungarian my name is Janos Kovacs. Human beings entering data make errors in spelling, phonetics, and keypunching. We also have to deal with variations associated with compound and account names, abbreviations, nicknames, prefix & suffix variations, foreign names, and missing elements. As long as humans are involved in entering data there will be a significant amount of garbage in any database. So how do we turn this gibberish into gems of information?
We need an intelligent system that can match data from different databases or tables despite the occasional (or frequent) errors due to typos, language or cultural differences, and transcription errors. The secret sauce behind Informatica’s Identity Resolution (IIR) is a fuzzy-matching algorithm that has a number of powerful features.
IIR has a comprehensive multi-national matching capability and is used by Customs and Immigration agencies, border security, investigation, and intelligence departments across the world. Commercial organizations are using IIR for Marketing, fraud prevention and credit scoring in countries like India, China, Indonesia, Japan, Russia, Israel, etc. not to mention the U.S.
Fuzzy Matching is a resource intensive process which normally requires some compromises in precision or completeness when dealing with large data volumes (>500 million records). IIR’s unique approach (based on the use of intelligent fuzzy keys to quickly select match candidates prior to the more intensive similarity scoring step) enables the solution to scale horizontally and vertically, allowing it to achieve extremely high throughputs (10’s of thousands of searches per second) and deal with very high data volumes (up to 1 billion records is common). Several Credit bureaus, intelligence agencies and commercial organizations in populous countries that handle extremely high volumes, are using the IIR matching engine at the core of their operations.
The 30 years of experience in dealing with all sorts of high impact, high risk matching requirements shaped IIR as one of the most precise solutions in the market. Instead of relying on a single technique, IIR uses multiple methods (probabilistic, deterministic, linguistic, empiric, heuristic, phonetic, etc…) in order to ensure all relevant matches are being found, while minimizing the number of false positives. IIR offers a great degree of flexibility allowing users to simply choose among different weights, purposes, and tolerances (or customize them). An important feature of the IIR matching approach is that it is completely independent of other data quality processes.
In short, it is indeed possible for a rule-based system, using time-tested algorithms, to turn data errors (garbage) into valuable business information. Stay tuned for my next blog on real-life use cases for IIR technology.