Metadata Management that Scales: Dealing with Big Metadata

A Big Metadata Problem

“We have 20,000+ known databases in our IT environment, with no idea what is in them”. This was how a meeting with a large enterprise customer started back in 2015. The newly minted CDO of the company was taking stock, only realizing now the enormity of the problem. “Our data analysts don’t know which data resides in which database. Frankly, we don’t either.” Interesting!

“We don’t know which of these databases contain sensitive data like personally identifiable user information either. And we don’t want that data to get into wrong hands”. And finally the kicker: “We anticipate that we have an equal number of unknown databases in there as well”.

Informatica has been in the enterprise metadata business for more than a decade. SuperGlue, the first generation Informatica metadata management product, was released in 2003 with capabilities like graphical data lineage for ETL mappings, change impact analysis and a relational metadata repository. Sarbanes-Oxley and Basel II were hot topics then, as CEOs and CFOs suddenly realized the importance of internal regulation of data; CIOs were struggling to produce all the required documentation for compliance; and data owners were creating hand drawn diagrams of data movement. A tool like SuperGlue which provided “data about data” and automatic data lineage became essential in understanding and managing data.

Fast forward a decade and many satisfied customers later, our metadata products are still helping users perform operational optimization, understand and report enterprise data flows when regulators come calling as well as manage enterprise business terminology.

But this meeting was clearly hinting at another generational shift in the metadata market. A wide section of enterprise users want to and are expected to use data in making decisions. Where earlier data was confined to use by IT and Report Developers, it is now a daily staple for every kind of business user, with appetites growing exponentially.

This change is pushing IT to its limits. They are being pulled at from all sides of the organizations with requests for more and more data. Advent of data lakes is giving these data users access and ability to analyze large amounts of enterprise data, but no tools to actually discover and understand it first.

We at Informatica are no strangers to scale. Even in the data-warehousing world, single lineage diagrams could sometime span 200,000 nodes and 400,000 links (necessitating a move to graph databases to render complex lineages with acceptable performance).

This though, was something else. Managing all this data was not going to be easy.

But do we really need to manage all data in the enterprise?

In other words, is all data created equal? The answer is no.

Data in an organization can be broadly divided into the following three categories:

 

 

  1. Key Data Elements: Data used in regulatory reporting, risk management, key business processes, exec reporting etc. all fall in this category. This kind of data requires full documentation, end to end data lineage, data quality checks at all touchpoints with identified data owners and data stewards. In effect, the full data governance firehose.
  2. Self Service BI Data: The second rung of the data circle is for organization’s self-service BI users, business users with varied skill sets, data scientists etc. These are users who are using Tableau or Excel to analyze why millennials reduced buying their product in the EMEA region last quarter, or does the user’s OS mattered for pricing decisions. These include raw application data, web logs, customer surveys, etc. This kind of data also requires documentation (mostly for internal users to be able to find the right data for their analysis), data lineage (for users to be able to trust the dataset) and preliminary data quality statistics (to provide information on what kind of additional data preparation is needed on this data).
  3. Everything else: Copies of schemas, Unstructured files, emails, Archives, (Unused) Legacy data all fall under this bucket. While it may be a tad OCD to fully document all data assets here, it is important to know if this data graveyard contains any sensitive data (customer’s credit card numbers, passwords, etc.), which if compromised can lead to public embarrassment, huge fines and loss of customer trust in your business.

Architecture of the next generation metadata management solution

It was clear that we needed a different architecture; one that can scale with new metadata volumes, for new metadata usecases determined by this data pyramid.

We eventually built the new Live Data Map platform to address all these needs. The key elements of the architecture are:

 

 

  1. Distributed Parallel Computing Architecture: We chose HBase as the new metadata repository, Titan as the graph database that stores all data relationships, Solr for indexing enterprise metadata and Spark for metadata processing. Compared to a relational repository, the new architecture can scale linearly with additional Hadoop nodes, is able to perform a lot more than just technical metadata extraction and can provide cheap horse-power(commodity hardware) for new metadata requirements, such as:
    • Parallel Metadata Ingestion: When automatically ingesting metadata from thousands of data sources it is important that these jobs be able to run in parallel. Hadoop provides the infrastructure to run multiple metadata ingestion jobs in parallel without affecting the performance of individual jobs.
    • Running Domain Discovery on millions of data assets for automatic classifications: Informatica’s Domain Discovery system can look at data patterns to identify what kind of data exists in the column/field. This helps in automatically classifying data as opposed to requiring humans to perform manual annotations. Informatica Secure@Source, which is an enterprise sensitive data detection and protection solution, uses the domain discovery capabilities of Live Data Map for accurately and automatically identifying sensitive data from thousands of data sources (with potentially 100s of millions of columns). Hadoop and Spark workflows for these classifications, scale much better for this kind of workload.
    • Sharded Indexes for unmatched search performance: Enterprise Information Catalog, another application built on the platform, provides a “google for enterprise metadata” to self-service BI and data governance users. It uses the sharded Solr indexes of the platform to provide speed of thought search performance.
    • Complex Data Lineage Rendering: Graph databases render complex relationship diagrams like data lineages with hundreds of thousands of nodes and links much faster compared to metadata stored on relational repositories.
    • Machine Learning: Informatica’s metadata platform provides metadata based machine learning capabilities like alternate dataset recommendations for Intelligent Data Lake and Data Similarity clustering for Enterprise Information Catalog. The power for running these processing intensive jobs comes from the distributed Hadoop infrastructure.
  2. Open Platform: The platform provides REST APIs that allow users and developers to directly work with the metadata service layer. Informatica’s applications like Intelligent Data Lake, Enterprise Data Catalog and Secure@Source are all built using these REST APIs. Additionally, the platform can ingest metadata from most organization sources (both Informatica and non-Informatica) including databases, datawarehouses, BI tools, ETL tools, Big Data stores, Cloud applications and databases and more. With the breadth of connectivity and an extensible open platform, partners and customers can use the platform in multiple ways including metadata reporting, purpose built metadata intelligence applications and more.
  3. Flexible Models: Another benefit of using a columnar store instead of a relational database as the metadata repository is that it allows for flexible models with dynamic properties, relationships and classifications. This allows simpler model extensions as well as performant search queries when it is time to retrieve objects and facts from the repository.

The data world continues to evolve. However many contemporary metadata solutions still use last decade’s technologies to solve today’s metadata problems. Before investing in another metadata tool for your arsenal, ask whether you are getting a knife for what is surely going to be a gun fight.

Comments