Classify to Completion: A Data Cataloging Story

Recently I was speaking to a customer in the financial services industry. He had an interesting requirement. He wanted to classify all data assets in the organization by associating the right semantic label with each data asset. The phrase he used was “classify to completion”. According to him this classification was a key first step in cataloging, governing and extracting value from data assets in their organization. We agreed. Finally someone that understood the importance of data domain discovery!

Data Domains

Data Cataloging

Informatica Enterprise Information Catalog has the capability of identifying semantic label of a column by evaluating data patterns and metadata. These semantic labels are called Data Domains. Data Domains are rule based where rules can be defined as regular expressions, reference tables and more complicated expressions provided through Informatica’s mapping language.

An example of a rule based data domain is Email. A simplistic regular expression to identify email is:

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b

To create a data domain for Email then you would create a new mapplet that returns TRUE every time this regular expression matches. Data Domain Discovery looks for matches against each value in a column and associates the data domain if it finds a large proportion of the values matching the pattern.

Enterprise Information Catalog ships with 60+ out of the box data domains and users can create their own to automatically classify data assets across the enterprise.

Problems with Rule based Data Domains

However, there are two key problems with Rule based data domains:

  1. Scale: Consider a large enterprise. It is common to find thousands of databases with columns numbering in 100s of millions. To classify all these columns to completion it is estimated that users may have to create around 20000 data domains. The organization will need an army of people to create individual mapplets and rules for 20000 domains. Certainly a process that cannot scale.
  2. False Positives and Negatives: Consider a data domain like “Age”. It matches against any number between 1 and 120. This broad definition results in a high number of false positives. Instead a smarter system would consider other metadata like column name, other columns in the table(Age occurs with First Name, Last Name, Email ID etc.), distribution(age distribution for customers will be very different for a financial services firm vs. a business targeting college students ) etc. to score the match.

Enter Smart Domains

This is where Smart Domains fare much better. Smart Domains do not require pre-created rules. Instead Smart Domains learn by example associations. Users can directly associate a smart domain with a column after which the system learns from the association and auto-propagates this domain to existing and new columns similar to this one.

Kind of like how Facebook suggests tags when someone uploads your photo.

 

Data Cataloging

Facebook compares the new photo against existing photos that have already been tagged to provide these suggestions.

Smart Domains are the equivalent of facial recognition for data sets.

Data Similarity

Data Similarity is one of the key factors used for suggesting data domains*. Data Similarity computes the extent to which data in two columns are the same. However it will be computationally prohibitive to try and compare all two-column pairs in an enterprise setting.

As an example, with 100M columns there are 5000 Trillion column-pairs to compare.If evaluating each pair took a nanosecond, the calculation would take roughly 5 Million seconds which is about 58 days!

Instead Data Similarity uses machine learning techniques to cluster similar columns and identify the most likely matches. This process uses the underlying big metadata platform for the clustering job.

Data Cataloging

Once the similar columns are identified, they can be used for multiple applications including Smart Domains. Others are:

  1. Recommending related data assets: An analyst working at a telecom company might be interested in doing customer churn analysis. She might start by querying for assets containing customer activity and find a spreadsheet containing call records of customers for the current quarter. Using Data Similarity, the system can recommend:
    1. A cleaned up version of the same data (substitutable data)
    2. Another table containing call records for previous quarter (union-able data)
    3. A customer detail table that might be joined to enrich the data with customer information (join-able data).
  2. Identifying Duplicates: An organization can save storage costs by getting rid of unused duplicate data assets. Data Similarity will help to find these duplicates across the enterprise
  3. Inferring Data Lineage: Lineage metadata today is extracted from sources like PowerCenter, Big Data Management, Informatica Cloud etc. However in many cases customers use hand-coding or processes like FTP to move data around. Using data similarity the system will be able to infer such lineage relationships automatically.

Column Level Data Similarity is available with Enterprise Information Catalog v10.1.1.

*While Data Similarity is the first factor we are using for propagating Smart Domains, we plan to expand that list in the future to include additional factors like Data Patterns, Column Metadata, Lineage etc.

Comments