In my recent blog posts, we have looked at ways that master data management can become an integral component to the enterprise architecture, and I would be remiss if I did not look at how MDM dovetails with an emerging data management imperative: big data and big data analytics. Fortunately, the value of identity resolution and MDM has the potential for both contributing to performance improvement while enabling efficient entity extraction and recognition.
Identity resolution that enables searching and matching of unique entity information uses a master entity data index containing just the right set of identifying data attributes to enable matching of a pair of records representing the same real-world entity (i.e., correctly resolving an identity) while differentiating between pairs of records that do not represent the same entity. This master index can actually be used in a variety of ways, with the most prevalent supporting consumer applications using the identity resolution functionality. However, an alternate scenario looks at ways that the master data environment supports the growing volumes of data from different sources that are the current focus of big data analytics. Here are three specific examples:
- Optimization of data access and reduction in latency: When master indexes contain an inverted link that maps a recognized identity to the original sources, they can be used to help actualize and somewhat optimize federated data queries. Once the specific entity’s identity has been acknowledged, the inverted map information managed within the master data environment can be used to specifically formulate the target queries to access data about each entity from the original sources.
- Entity extraction and identification: The identity resolution methods can embed the master indexes and identity resolution within stream processing applications to extract entity information and then determine whether that data matches a known entity.
- Enhanced cross-domain relationships: As the number of entities grows, there will also be a need for a scalable framework for representing and managing the profiles and relationships. And because the number of relationships may grow in proportion to the square of the number of entities, this data set has the potential to expand rapidly and broadly, requiring a scalable environment for management and storage.
From one perspective, MDM can be used to improve system performance. From another, then, we should consider ways that new techniques can improve MDM performance. Some ideas include implementing the master repository using an in-memory processing system, using columnar databases to manage the master index and improve access speed while decreasing footprint, or even implementing the master repository and the corresponding identity resolution methods on top of a distributed and parallel system such as Hadoop.
This last item is quite intriguing, since identity matching is nicely suited to the kind of parallel and distributed computing capabilities provided by Hadoop in general, and MapReduce in particular. In essence, most matching algorithms attempt to first resolve the provided identifying data to subsets of records that are possible matches, then scanning through those subsets to find more precise matches. These subsets can be distributed across a network of processing nodes. Each search is mapped to a set of subsets, and then farmed out to the nodes containing those subsets. The precise scans can be performed in parallel. This not only speeds each individual search, it also allows many searches to be performed at the same time, leading to increased throughput with shorter response times.
And we can go beyond system optimization. Integrating data quality assurance within the MDM environment can improve data trustworthiness, while providing the ability to dynamically mask sensitive data prior to delivery allows data to be used without violating data protection policies. Instituting methods for increasing the volume and speed of data utilization while ensuring level of trust in a scalable way paves the way for expanded universality of master data.
To hear more details on the topic of this blog series, you can listen to the replay of this Webinar: “Experts Share How to Launch an MDM Program Quickly & Successfully.”