With the rise in popularity of the elusive and expensive data scientist it’s very sad that once a data science team is assembled (at a very high recurring cost to the company I may add) that they spend most of their time doing work they weren’t really hired to do in the first place. That’s right! It turns out that data scientists spend only about 20% of their time doing real analysis – that is the work they were trained to do. How is the other 80% of their time spent?
In a recent Information Week article – Meet The Elusive Data Scientist – Catalin Ciobanu, a physicist who spent ten years at Fermi National Accelerator Laboratory (Fermilab) and is now senior manager-BI at Carlson Wagonlit Travel, said “70% of my value is an ability to pull the data, 20% of my value is using data-science methods and asking the right questions, and 10% of my value is knowing the tools.” DJ Patil, Data Scientist in Residence at Greylock Partners (formerly Chief Data Scientist at LinkedIn) states in his book “Data Jujitsu” that “80% of the work in any data project is in cleaning the data.” In a recent study that surveyed 35 data scientists across 25 companies (Kandel, et al. Enterprise Data Analysis and Visualization: An Interview Study. IEEE Visual Analytics Science and Technology (VAST), 2012) a couple of data scientists expressed their frustration in preparing data for analysis: “I spend more than half my time integrating, cleansing, and transforming data without doing any actual analysis. Most of the time I’m lucky if I get to do any ‘analysis’ at all.”, and another data scientist informs us that “most of the time once you transform the data … the insights can be scarily obvious.”
The good news is that Informatica can rescue data science teams from the trouble and toil of accessing data, parsing, extracting, standardizing, normalizing, transforming, cleansing, matching, and preparing data for analysis – in other words data integration and data quality – this is what Informatica does better than anyone else as recognized by leading industry analysts such as Gartner in their 2012 Magic Quadrant for Data Integration Tools Report. Unfortunately, and all too often data scientists reluctantly resort to time consuming hand-coding to access, integrate, and prepare data for analysis. The new PowerCenter Big Data Edition provides a no-code collaborative development environment to make it easy for both data scientists and developers to define data integration flows using an intuitive visual language. Data scientists and analysts can also use a browser-based tool to define data integration specifications in an Excel-like fashion – and automatically generate data flows. These data flows can be deployed to run on Hadoop or traditional grid computing infrastructure – in other words you can design once and deploy anywhere. I also recommend reading Dr. Ralph Kimball’s white paper on Newly Emerging Best Practices for Big Data where he details more than twenty big data best practices spanning four categories including data management, data architecture, data modeling and data governance.
So let your data scientists be scientists! Use Informatica for data integration and data quality and use your data scientists to discover more valuable insights from big data. In my next blog I’ll discuss how to turn these insights into breakthrough results.