This blog discusses going beyond basic data profiling. But for those of you who don’t know what basic data profiling is, let me summarize quickly. Basic data profiling is what I call three-dimensional analysis. I discuss this in some depth in my book, “Three Dimensional Analysis Data Profiling Techniques.”
Basic profiling includes column profiling, table profiling, and cross table profiling. Column profiling is automated discovery of the true metadata of your data. This is the process of identifying the accurate data type and precision, minimum value, maximum value, the number of nulls, percent null and more on a column by column basis. Table profiling attempts to infer a primary key based upon the data. You can also infer functional dependencies within each table. Cross table profiling is about finding primary key / foreign key relationships between the tables, as well as overlap analysis. This technology has been around since the late 90s. It boggles my mind that many data related projects still do not perform basic profiling before they embark on the project. But that, as they say, is another story.
Data Domain Discovery
Let’s talk about some additional features that will take you beyond basic data profiling. The first feature I will discuss is something we call Data Domain Discovery. This feature will quickly scan the data in a table or tables, looking to see if the data or metadata conforms to a functional meaning. For example, is the data a phone number, social security number, credit card, address, city, state, etc.
At the basic level, this is the ability to create a data or metadata rule and automatically run it at against all of the columns in a table. If you are familiar with Informatica Data Quality or PowerCenter, this is a mapplet. The mapplet can have one input (runs against each column, so that is the input) and one output (looking for a positive or negative answer, does the data or metadata conform to the rule or not). However, you can have any number of passive transformations in the mapplet. Many customers use reference data or patterns to validate the data.
A simple data rule could be a Null rule. I know in basic data profiling it will automatically show you the number of nulls that are contained in the data. But this rule will go beyond just identifying a null. You can write this rule to look for nulls, blanks or zeros in a numeric field. You can also include other terms that might be used to identify an empty field, values like: not applicable, N/A, not used, empty, etc.
Normally, you would find these additional anomalies by visually inspecting the value frequency results of basic column profiling. Data domain discovery takes it a step beyond what basic profiling provides and speeds up the process and improves accuracy by removing the human element.
Another use of this feature is identifying sensitive data. A simple example is looking for a credit card number or a Social Security number embedded in a column. You can do some of this in basic data profiling, for example looking for a pattern of a nine digit numeric, maybe separated by dashes or some other identifiable pattern. In basic data profiling, it will show you the different patterns of the entire column that exist in your data. But I’m talking about finding sensitive data in your data. What about a comment field that has an entry like:
credit card number 4425 1000 3465 2164
That pattern would look something like this:
What Data Domain Discovery allows you to do is look for the 12 digit pattern or credit card number in the string as opposed to the pattern of the column. From a business perspective, this allows you to identify sensitive data before it is inadvertently exposed to non-authorized personal, preventing major fines from the government. This information may be found in the comment field or in old mainframe systems’ columns where the columns were routinely reused for other purposes. The metadata describing a column may have nothing to do with what is actually contained in it.
In the US there are a number of initiatives to make sure different types of information are not inadvertently exposed to those who should not see it. A couple of examples that come to mind are (PII) personally identifiable information, (PHI) personal health information and (CHD) Cardholder Data or (PCI DSS) Payment Card Industry Data Security Standard. For accidentally exposing credit card numbers, the payment brands may, at their discretion, fine an acquiring bank $5,000 to $100,000 per month for PCI compliance violations.
In the picture above, you can see the results of Data Domain Discovery. The phone number rule found results in three tables, Customer_1, Customer and Organization. Both the data and the metadata rules returned positive results. At the same time, other rules for credit card numbers, location, social security numbers and zip codes were also run and returned positive results.
Another example of where Data Domain Discovery can come into play is if you are about to embark on a master data management project. A major task in a master data management project is to bring together all of the data related to a customer. Sometimes this is extremely difficult as no one in the organization really knows what data sources contain customer data. The data domain discovery can build rules that will automatically try to identify customer types of data. For example, I can build rules to look for countries, states, zip codes, credit card numbers, account numbers, etc. This would quickly find tables that would require an inspection to determine if the data is about a customer or if it is vendor data or employee data. This discovery could then be used in later master data projects for vendor and employee. Similar rules could be written to identify product data.