Next Gen Analytics Strategies: Data Cataloging is the First Step
In my first blog of this series on Next Gen Analytics Strategies I discussed the importance of investing in a data management platform with an enterprise unified metadata foundation powered by AI and described how the Informatica Intelligent Data Platform uses the CLAIRE™ engine to provide such a foundation. In this blog post I’ll explain the importance of a data catalog.
As I had mentioned in the blog post that summarized Next Gen Analytics Strategies I recommended that before tackling any project its always wise to first take inventory of what’s available. The same is true for an analytics project where the first step is to find relevant data that can be trusted. An intelligent data catalog powered by AI/ML can discover, classify, provide powerful search capabilities and recommend data that is fit for use in an analytics project. A good analogy to the data catalog is the Hubble space telescope. Before we put a telescope into space our view of the stars, planets, and galaxies was rather limited. The Hubble telescope enabled us to get a much more complete view of the universe. This is what a data catalog can do for enterprise data and the information it provides is extremely useful for self-service analytics, data governance, and IT impact analysis.
Just like a powerful space telescope that scans the universe, Informatica’s Enterprise Data Catalog (EDC) scans and collects metadata from enterprise systems including many types of databases, applications, and tools. It then automatically builds out a metadata and relationship graph exposed via REST APIs so end-users and developers can query metadata for other applications or integrations. Informatica EDC provides very detailed lineage down to the attribute and column level and even supports scripting languages like BTEQ & PLSQL so that analysts can explore the provenance of data to see if it can be trusted. Using CLAIRE, EDC discovers and classifies data, providing users with a very intuitive search experience (even recognizing synonyms). You can search on business keywords and filter on out-of-the-box or custom facets to find just the data you’re looking for.
EDC uses CLAIRE to automatically discover data domains such as name, phone, email, etc. and entities like purchase orders that can span data sets. Within EDC, CLAIRE automatically tags data by learning from previous user actions and is intelligent enough to recommend similar data sets that might be of interest to you. Using both supervised and unsupervised machine learning, EDC can even identify and classify entities within unstructured data such as text which often appears in Microsoft Office applications and PDF docs. While EDC does not yet do facial recognition, it can do something similar for columns in data sets — using CLAIRE’s AI/ML clustering algorithms EDC can identify similar data. Another important capability of a data catalog is that it should facilitate collaboration. EDC lets you manage business context by editing Wikipages for data assets and provides extensive team collaboration that includes user endorsements and commentary and data steward certifications. And by the way, if you’re a Tableau user you can quickly find data by accessing EDC directly within the visualization tool.
So, before you start any analytics project remember to search the data catalog to find the data you need and can trust. Oh, you don’t have a data catalog? Then I suggest you get started by trying out a free Informatica Enterprise Data Catalog trial here. In my next blog I’ll discuss Next Gen Analytics Strategies: Optimize the data pipeline for Big Data