Data matching is a core element in many deployments of data quality tools and master data management solutions.
Most data matching implementations are revolving around matching names and addresses. The classic business goals for a data matching activity are removing duplicates and thus avoiding sending the same material twice or even more times to the same real world individual either as a private person or a business contact.
Hierarchical Data Matching
As more organizations are looking into master data management they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.
One example is a household. You probably don’t want to send two sets of the same material to a household, but you might want to engage in a 1-to-1 dialogue with the individual members. Another example is that you might do some very different kinds of business with the same legal entity. Financial risk management is the same, but different sales or purchase processes may require very different views.
Exploiting Rich External Reference Data
Is “Margaret Smith” at “1 Main Street” and “Peggy Smith” at “1 Main Str.” the same real world individual? We traditionally use nickname synonyms to establish a match. We may use address validation to check if the standardized addresses are the same. We may also use rich location data to establish the probability, which is higher if “1 Main Street” is a single family house and not a nursing home or university campus.
Business directories have been used for a long time for matching with legal entities, so instead of matching names and addresses you are matching legal entity identifiers. Catching a legal entity identifier in the registration process will help a lot in future matching.
Using a national identity number in the same way is in many circumstances a no-go for individual persons for privacy reasons. What then works best is catching a well spelled name and a standardized address by exploiting available external data in data entry by searching the sources with data matching techniques.
Matching Bigger Data
Datasets grow bigger every day and so do the datasets we use in matching. Data matching with traditional means typically meets a threshold where the speed drops under a useful limit due to doing too many fuzzy comparisons with too many different structures of data.
When we have datasets with international party master data not only do the volume grows, but the variation in structures of the reflected names, addresses and other data forces you to make many different sorts of queries. Sometimes segmentation of datasets and in-memory matching doesn’t solve this challenge.
So here we will see a marriage of the new technologies for big data and techniques for data matching.
Multi-Channel Data Matching
Party master data comes from many channels today and the content, structure and quality of the data vary over the channels.
Social networks have profiles belonging to the same real world individuals about whom we are mastering data in the traditional systems of record. As more organizations embrace social business and the way of doing this matures, we will need to match master data from the systems of social engagement with master data from the good old channels, that won’t go away.
The techniques for doing that go beyond similarity between names and addresses. We need to consider a range of attributes in various structures. We even may need to look into facial recognition.
Multi-Domain Data Matching
Data matching isn’t limited to names and addresses and other party master data attributes.
Matching product master data such as descriptions and a wealth of other attributes is an important part of many product master data management implementations. Product master data are shared between manufactures, distributors and retailers in huge volumes today. The risks and costs of having duplicates may be huge, so we will certainly see data matching evolving in the product domain as well.