The Easy Button
Basic data profiling, while significantly automating the process, is still a manually intensive effort to perform the detailed analysis. For example, normally you have to go through and select the table or tables that you want to participate in the analysis; configure the profile; and then run the profile, etc. At Informatica, we are going beyond basic data profiling other ways. The first is what I jokingly refer to as the Staples™ easy button. A new feature built on our advanced profiling is called Enterprise Discovery. This feature allows you to point at a schema or schemas and selectively run column profiling, primary key profiling, foreign key profiling and data domain discovery. So with a few clicks of the mouse you can run all your profiling requirements in one step against hundreds or thousands of tables.
Proactive Data Monitoring
Another way Informatica is moving beyond basic data profiling is having the ability to write rules to look for multiple events and trigger actions automatically. For example, after you run enterprise data discovery, you have a number of rules in place that would check the output of the profiling automatically and trigger events or actions to investigate or correct data quality issues. We refer to this as Proactive Data Monitoring.
This can also be viewed in another way. Once you have completed your profiling, you can set up a scorecard to monitor the data quality in your system. When the thresholds in the scorecard change, you can automatically e-mail a person or persons to let them know the quality level of the data in the system is deteriorating.
Now I can’t seem to have a discussion about anything these days without mentioning Big Data. No matter how you define it, whether you are talking about social media, RFID, volumes of transaction data, mobile devices, etc., Big Data presents its own problems with profiling. Informatica Data Explorer has always had the ability to profile billion row tables. Unfortunately, we are looking at tables growing a magnitude or larger. While this is not a new analysis feature, we do offer the ability to run profiling (and other products like PowerCenter, Informatica Data Quality, etc.) on the Hadoop platform.
Your Help Needed
As we move forward, we are looking at additional capabilities like Correlation Analysis. Is there a correlation between longevity of employment and salary? (I would stand out as an anomaly here ) Is there a correlation between discounts and quantity of product ordered? Is someone getting a large discount that is not justified by the volume of business they are doing with the company. Maybe correlation analysis can assist in root cause analysis. When I look at all of my data quality anomalies, are they associated with one or more individuals, departments or locations that enter that data?
One final area of interest is the role of the data scientist. I read the following quote,
“70% of my value is an ability to pull the data, 20% of my value is using data-science methods and asking the right questions, and 10% of my value is knowing the tools,” says Catalin Ciobanu, a physicist who spent 10 years at Fermi National Accelerator Laboratory (Fermilab) and is now senior manager-BI at Carlson Wagonlit Travel.
Before you can actually perform the statistical analysis required to derive actionable intelligence from the data, you must first understand what data you have, how it is initially structured and if the data will support your analysis goals. Are there changes or additional functionality that can accelerate this process?
As we move beyond basic data profiling there are a number of other potential applications for profiling. I would like to hear from you about other applications or enhancements to data profiling that can assist you in meeting the requirements of your job.