Self Service BI vs Data Governance: How to be “Data Agile” with a Catalog
Business users always had an insatiable hunger for data. They were already discovering and mashing up data from different sources, aggregating them in excel and presenting PowerPoint analysis in the 1990s, albeit with difficulty. Then technology caught up. Self-service Business Intelligence tools like Tableau, Qlikview and more recently Amazon QuickSight made the aggregating, visualizing and sharing of data easier. The claim was zero to analytics expert in minutes! Overnight business users became data heroes. They also started seeing IT which had for years prepared and governed use of data in enterprises, either unnecessary or worse, an impediment on their journey to insight.
However, as many organizations are realizing now, this new world came with newer problems:
Correlation, Causation and Coverage: With Self Service BI tools, organizations expected business users to find causes for major past issues, predict market situations better in the future and take better decisions today. However, the path to insight is paved with obstacles. Most analysis do not factor in all data even when it is available in the enterprise. Even when business users knew that a particular attribute can make an analysis better, they did not use it because it is impossible to know where to find that attribute or even if that attribute was recorded/available to them. This resulted in shortcuts, uncorrelated and incomplete analysis, finally leading to incorrect decisions.
Bad data=Bad Analysis: Business users went through all the work of discovering data sources, creating elaborate reports and painstakingly generating new insights, only to realize that data they used was obsolete or came from a wrong source or had serious data quality issues or was used in an unintended way without understanding of business context (Should Customer Acquisition Cost be averaged daily, monthly or yearly? What should be the value of customer lifetime in calculating Customer Life time value? Should I count only closed deals or commits while assessing the sales force efficacy for a quarter?). Add behavioral issues like “seeing what you want to see” or “pointing data fingers at anyone other than me”, organizations are slowing realizing the unreliability of user created reports and dashboards for taking any meaningful decisions.
(Re)running the preparation wheel: Self Service Data Preparation and BI also resulted in a large number of business users performing the same preparation, standardization and reconciliation tasks again and again for the same (raw) datasets. So while the individual wait for getting access to raw data reduced, the whole organization paid multi-fold by repeating the same tasks.
How can a data catalog help?
It is clear that along with self-service BI and data preparation tools, a data “superhero” needs the following additional capabilities to be truly effective:
Search: Ability to quickly find data relevant to analysis needs is essential to derive value from data. This search should also lead user to get business and usage context for the data as well: who has used the same datasets in the past? For what kinds of analysis? How was it transformed? What are the other related datasets?
These users really need a “google for enterprise data assets”.
Informatica’s Enterprise Data Catalog can automatically scan and index metadata from most data and application sources in the enterprise. Once indexed business metadata from the enterprise business glossary can be associated with technical assets which can help business users search for these datasets using business terms instead of technical jargon. The system understands synonyms and known variations of the same term, to deliver intelligent search results. Enterprise Data Catalog also scans data movement and data preparation sources, cataloging and indexing recipes and mappings that can help with reuse of work in transforming data.
Establish Trust: Ability to trust the discovered data asset is important as well. What is the data quality of the dataset? Was it used in external reporting (can be reasonably trusted)? Is this the trusted source of customer segmentation data when I am doing segmentation analysis?
Enterprise Data Catalog extracts both data quality statistics and data lineage relationships to help users with establishing trust and relevance to their analysis needs. It also allows users like data consumers, data stewards and data owners to add additional metadata like comments and tags to help distinguish good data assets from time sinks.
Classifications: Ability to classify datasets for better management is essential as well. Classifications can be across multiple dimensions like data ownership or geographical locations, or the semantic label of contained data or something else. These classifications are the first step in governing, managing and extracting value from data. However, for enterprise size data classification problems, we also need platforms that can scale. If the system depends on humans to classify all data assets manually, it will take an eternity to classify all data. If it performs all the classifications automatically, all the human time will go in cleaning up false positives. Enterprise Data Catalog uses machine learning combined with crowd sourced annotations to classify datasets. Smart domain feature allows users to manually annotate columns with domain labels and then machine learning techniques are used to propagate these labels to other “similar” columns. Additionally, users can create custom attributes to classify and facet data assets across multiple dimensions. Finally, self-service BI users can use these classifications while searching for data assets and understanding all the context around the data asset before they use them in their analysis.
For BI users, Enterprise Data Catalog supports extracting reporting metadata from multiple BI platforms including: Tableau, Microstrategy, IBM Cognos, SAP Business Objects, OBIEE and more. While there is no denying that self-service is the way ahead, it needs to be balanced with the right amount of governance “buck” for the big analytics “bang”. Enterprise Data Catalog with automatic data assets scan, powerful search indexes and a machine learning based classification platform is what the Self Service BI users need from becoming data heroes to superheroes.