My Thoughts on Gartner’s Thoughts About the Challenges of Hadoop for Data Integration

Just read a great Gartner report titled “Hadoop is Not a Data Integration Solution” (January 29, 2013). However, I beg to differ slightly with Gartner on this one. The title should have been “Hadoop Alone is Not a Data Integration Solution.” The report outlines all of the reasons that just deploying Hadoop by itself is often quite challenging.

Issues that we at Informatica have personally seen our customers deal with include:

  • Getting data out of legacy sources is the exact same challenge for Hadoop developers as it is for ETL developers. Same problem, just a different processing engine.
  • You still have to be able to profile and cleanse the data before using it. Like most data sources, even big data is dirty.
  • In the case of social media data, you have to be able to parse it to extract the useful information that has meaning for you before you can use it. This is similar to the profiling and cleansing problem, just a new flavor of it in the big data world.
  • You still have to combine data from multiple sources in a useful manner. It doesn’t just magically all come together.

That is just to name a few off the top of my head.

Other market challenges that we have also directly seen:

  • “Data Scientists” who know Hadoop, map reduce, Pig etc, and know it well, are very rare and expensive. Building a team that has the skills to be successful is difficult an expensive
  • 80% of the effort in doing data analysis on Hadoop is simply performing classic data integration tasks. Without automation of the integration tasks, the expensive people noted above only have 20% of their time, if that, to dedicate to innovation.

I just gave a talk at the Global Big Data Conference on January 28th and my topic was on the risks that exist that could keep Hadoop from crossing the chasm into the mainstream.  The same risks I just mentioned above. However, the second half of my talk covered what is happening that will allow Hadoop to cross the chasm.

Gartner is correct in that, Hadoop, by itself, is NOT a data integration platform. However, it can be made into a data integration platform. Lots of companies are investing in making Hadoop based integration easier. Informatica has done this by porting our Virtual Data Machine onto Hadoop to allow companies to use the same integration development environment they use today to build ETL jobs, and run those same jobs using Hadoop as the underlying engine. In fact, it is possible to also parse, profile, and cleanse data on Hadoop…. TODAY!

As I noted, we aren’t the only vendor investing in solving this problem. The market in general is moving in this direction so expect to see some exciting capabilities emerging over the next six months.

We know that companies who use this kind of graphical development environment over hand coding see a 5x time savings. We have also seen that people who are not Java/map reduce/Hadoop experts can develop quite sophisticated integration jobs that run in the Hadoop environment with the same skills they already have.

So technically, Gartner was not wrong. Hadoop itself is absolutely insufficient as a standalone architecture to be competitive for the majority of companies. This is a common thread that we are hearing from most companies. Once you take away the big web companies and some wall street firms, most companies do not want to hire an army of “data scientists” to use Hadoop for some breakthrough results.

However, Gartner missed the second half of the story. By taking traditional ETL platform development environments, and making them run on top of Hadoop companies are able to streamline development, assure confidence in their data and integration processes, and allow them to reach their potential by innovating without barriers.

This entry was posted in Data Integration and tagged , , , , , , , , , . Bookmark the permalink.

One Response to My Thoughts on Gartner’s Thoughts About the Challenges of Hadoop for Data Integration

  1. Todd – love your response here (Your still great at making provocative and thought provoking observations I see – disclaimer Todd and I worked together at IBM in a past life)

    What I love about Mapreduce is that while it’s an open source technology it’s going through an adolescent phase (fast uptake and dealing with some interesting growing pains) at the same time vendors big and small from startups to established players are taking the opportunity to contribute to the framework and make it more enterprise ready.

    At Syncsort we’re seeing exactly the same thing BTW it’s interesting how dependant the core Mapreduce technology like most others is on data ordering / sorting – it actually happens twice in every flow on both the map and reduce phases.

    One of the key things however is to not just provide an alternative to hand coding java etc.. but to do it in a way that doesn’t bottleneck the framework (Amdahl’s law in full affect)

    Regarding data scientists a guest blog I delivered in computerworld recently covers my thoughts there which seem to be very similar to yours :)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>