Exploring Open Data with Data Prep
What the Heck’s in That (Open) Dataset? Here’s a No-Brainer Way to Find Out!
Open data is just plain…awesome!
If you need some inspiration or ideas for external data sources to analyze, you can start exploring open datasets for statistics about various topics ranging from agriculture to business, public safety, education, consumers, government, energy, finance, and more…
What will you discover? Insights galore! And maybe even a few “unknown unknowns” you didn’t know you were looking for!
But one of the challenges of working with an open dataset (or any “blind” dataset that gets handed to you, for that matter) is you don’t know what the heck’s inside!
- You don’t know what types of data are included or what the value frequencies are.
- You have no idea how many rows and columns there are.
- You don’t know the format or the structure.
- You don’t know what’s being tracked, and so on…
That’s where data prep can come in handy. (More on that later.)
I decided to go on a mini open data excursion to see where it’d take me.
I downloaded an open dataset. And I liked it…
That’s how I found out:
- The top problem for financial services
- Which bank had the most consumer complaints
- The most popular channel for complaint submissions
Here’s the no-brainer way I stumbled upon these insights within a matter of minutes!
- I went to data.gov and poked around… I clicked on the Data menu and decided to check out the Consumer Complaints Database.
This is where I was able to download the open dataset as a CSV file. (You can also download this dataset as a JSON File or XML File.)
- After I downloaded the dataset, I opened it up in Excel to see what was inside…
And this is what it looked like (no surprise!).
At a glance, I didn’t really know what was in there at this point, except for what I skimmed from the column headers: dates, products, issues, company (names have been obscured for anonymity here), state, zip code, etc.
I definitely couldn’t tell you anything useful or interesting about the data in this state.
Now comes the fun part!
- I uploaded this dataset to Informatica Data Preparation (REV) and right away I was able to see some interesting things about consumer complaints in the financial services industry.
ONE click to “Unknown Unknowns”
By just clicking on the Product column, I was able to see that “Mortgage” was the top complaint category, followed by “Debt collection” and “Credit reporting.”
Being able to see the value frequencies like this saved me from fiddling with Excel to get this view. I was able to see all categories of problems (or “products” as they were labeled here), as well as the frequency or how often they each occurred.
In a matter of seconds, I was able to learn the problems financial institutions deal with the most. All without having to scroll through an Excel spreadsheet and googling Excel formulas to organize and reformat the data to get this view.
That’s pretty much the easiest, “most no-brainer” way to understand this dataset at a high level, if you ask me.
Knowing this information, financial services organizations can focus on ways to improve programs and services to reduce the number of complaints related to mortgages. They could also see if there were any correlations between mortgage complaint rates and account closures or customer churn.
Want competitive insights? You could analyze the time-to-resolution rates or number of outstanding complaints to see how your company compares to the rest.
But this is a great starting point for identifying industry problems and trends, especially if your goal is to stay competitive and keep your customers happy and loyal.
In any case, this alone raises a lot of questions and gets you thinking. And that’s a good thing!
- Next, I took a look at the Company column and… I found out which bank had the most complaints!
No special formulas necessary to find this juicy insight (told you, this is a no-brainer way to learn new things about different industries and topics!).
Let us know if your bank made the list! (Mine did! Now keep in mind this dataset is dynamic and is continuously being updated, so your results may be different based on when you download the dataset. I downloaded this 10/20/2016.)
- While I clicked around more, I looked at the “Submitted via” column, which is how I learned that customers submitted most of their complaints through the web, whereas Email was the least popular channel.
Hope you have as much fun exploring open datasets as I did!
If you have 5 minutes, you have time to explore this open dataset and go through this quick exercise. And I highly encourage you to try it. You’ll be surprised at how easy it is and how fast you can come up with ideas for applying these insights. Sign up for a free trial of REV and download this dataset to start exploring!
What open data do you use? How have you used open data for your analytics projects? Share your experience in the comments below!