Reflections Of A Former Data Analyst (Part 2) – Changing The Game For Data Plumbing
In my last blog I promised I would report back my experience on using Informatica Data Quality, a software tool that helps automate the hectic, tedious data plumbing task, a task that routinely consumes more than 80% of the analyst time. Today, I am happy to share what I’ve learned in the past couple of months.
But first, let me confess something. The reason it took me so long to get here was that I was dreaded by trying the software. Never a savvy computer programmer, I was convinced that I would not be technical enough to master the tool and it would turn into a lengthy learning experience. The mental barrier dragged me down for a couple of months and I finally bit the bullet and got my hands on the software. I am happy to report that my fear was truly unnecessary – It took me one half day to get a good handle on most features in the Analyst Tool, a component of the Data Quality designed for analyst and business users, then I spent 3 days trying to figure out how to maneuver the Developer Tool, another key piece of the Data Quality offering mostly used by – you guessed it, developers and technical users. I have to admit that I am no master of the Developer Tool after 3 days of wrestling with it, but, I got the basics and more importantly, my hands-on interaction with the entire software helped me understand the logic behind the overall design, and see for myself how analyst and business user can easily collaborate with their IT counterpart within our Data Quality environment.
To break it all down, first comes to Profiling. As analyst we understand too well the importance of profiling as it provides an anatomy of the raw data we collected. In many cases, it is a must have first step in data preparation (especially when our raw data came from different places and can also carry different formats). A heavy user of Excel, I used to rely on all the tricks available in the spreadsheet to gain visibility of my data. I would filter, sort, build pivot table, make charts to learn what’s in my raw data. Depending on how many columns in my data set, it could take hours, sometimes days just to figure out whether the data I received was any good at all, and how good it was.
Switching to the Analyst Tool in Data Quality, learning my raw data becomes a task of a few clicks – maximum 6 if I am picky about how I want it to be done. Basically I load my data, click on a couple of options, and let the software do the rest. A few seconds later I am able to visualize the statistics of the data fields I choose to examine, I can also measure the quality of the raw data by using Scorecard feature in the software. No more fiddling with spreadsheet and staring at busy rows and columns. Take a look at the above screenshots and let me know your preference?
Once I decide that my raw data is adequate enough to use after the profiling, I still need to clean up the nonsense in it before performing any analysis work, otherwise bad things can happen — we call it garbage in garbage out. Again, to clean and standardize my data, Excel came to rescue in the past. I would play with different functions and learn new ones, write macro or simply do it by hand. It was tedious but worked if I worked on static data set. Problem however, was when I needed to incorporate new data sources in a different format, many of the previously built formula would break loose and become inapplicable. I would have to start all over again. Spreadsheet tricks simply don’t scale in those situation.
With Data Quality Analyst Tool, I can use the Rule Builder to create a set of logical rules in hierarchical manner based on my objectives, and test those rules to see the immediate results. The nice thing is, those rules are not subject to data format, location, or size, so I can reuse them when the new data comes in. Profiling can be done at any time so I can re-examine my data after applying the rules, as many times as I like. Once I am satisfied with the rules, they will be passed on to my peers in IT so they can create executable rules based on the logic I create and run them automatically in production. No more worrying about the difference in format, volume or other discrepancies in the data sets, all the complexity is taken care of by the software, and all I need to do is to build meaningful rules to transform the data to the appropriate condition so I can have good quality data to work with for my analysis. Best part? I can do all of the above without hassling my IT – feeling empowered is awesome!
Use the right tool for the right job will improve our results, save us time, and make our jobs much more enjoyable. For me, no more Excel for data cleansing after trying our Data Quality software, because now I can get a more done in less time, and I am no longer stressed out by the lengthy process.
I encourage my analyst friends to try Informatica Data Quality, or at least the Analyst Tool in it. If you are like me, feeling weary about the steep learning curve then fear no more. Besides, if Data Quality can cut down your data cleansing time by half (mind you our customers have reported higher numbers), how many more predictive models you can build, how much you will learn, and how much faster you can build your reports in Tableau, with more confidence?