Collaborative Data Discovery with Enterprise Data Catalog
Harry Potter fans will remember the scene from “Harry Potter and Deathly Hallows” where upon entering the Lestrange vault, the treasures start multiplying repeatedly because of the Gemini curse (doubling charm). The objects in the vault proliferate rapidly, the duplicates multiplying themselves, and duplicates resulting from those duplications, ad infinitum, nearly crushing our heroes.
If you are in data management, the above scene would have hit close to home. As number of data consumers are increasing, so has the duplication of data assets across the organization. The multiple copies and views which were a problem in the database world, have become a bigger problem in the data lake world, where users are copying and creating new datasets without any controls.
In such a scenario just automatic cataloging of your data assets is not enough, as that will result in a large number of duplicate and similarly named objects in the inventory. This will be a Lestrange Vault for your data consumers — data analysts, scientists and Line of Business users, who will not be able to identify the authoritative sources of data when they search for them.
The Magic of Collaboration
One way to stop this duplication is to create rigorous controls that stop data consumers from creating these duplicate data assets, stop the curse at its root. However, in the real world such draconian, creativity crushing regulations rarely work. Instead we need a combination of machine intelligence and human collaboration to manage this proliferation better.
This is where the new capabilities of Informatica’s Enterprise Data Catalog (EDC) come in. These capabilities bring to forefront the otherwise deeply siloed knowledge about trustworthiness and usefulness of datasets using human collaboration. This will help data consumers save weeks, sometimes months of efforts in finding and using the right dataset. Here is how the magic happens:
With EDC v10.2.2, Subject Matter Experts, Data Stewards and Data Owners can certify the datasets adding context information like data usage and constraints. Using EDC’s machine learning based semantic search, EDC will guide users to use these certified datasets among all other similarly named datasets in the organization. The process works same as Google Adwords. While certifying the datasets, keywords are provided by the SMEs on the dataset. Enterprise Data Catalog will ensure that the certified datasets are at the top of search results when these keywords are used in the search query.
Reviews and Ratings
Also, data consumers can now review and rate datasets just like consumers reviewing products on Amazon. EDC pushes datasets that are rated highly to the top of the search results. This ensures that you don’t need an army of data curators/stewards to certify data assets before they can be discovered by the consumers. Instead, the review/rating system will ensure that the most used datasets show up at the top of the search results. Also, the reviews will be a good place for users to signal what other users can expect from the dataset in terms of data quality, applicability and reliability.
Questions and Answers
Additionally, users will be able to use a new question/answer platform that allows subject matter experts to answer the most common questions of the data consumers. Today all these discussions on data assets occur over email and phone calls. Most of this information is lost and the subject matter experts need to answer the same questions again and again. By centralizing these discussions, EDC ensures that the information is available to all data consumers.
With the above new collaboration capabilities, Informatica’s Enterprise Data Catalog ensures that your data consumers can find the right data even as it multiplies across your data lake and other data sources. Mischief managed!