Just read a great Gartner report titled “Hadoop is Not a Data Integration Solution” (January 29, 2013). However, I beg to differ slightly with Gartner on this one. The title should have been “Hadoop Alone is Not a Data Integration Solution.” The report outlines all of the reasons that just deploying Hadoop by itself is often quite challenging.
Issues that we at Informatica have personally seen our customers deal with include:
- Getting data out of legacy sources is the exact same challenge for Hadoop developers as it is for ETL developers. Same problem, just a different processing engine.
- You still have to be able to profile and cleanse the data before using it. Like most data sources, even big data is dirty.
- In the case of social media data, you have to be able to parse it to extract the useful information that has meaning for you before you can use it. This is similar to the profiling and cleansing problem, just a new flavor of it in the big data world.
- You still have to combine data from multiple sources in a useful manner. It doesn’t just magically all come together.
That is just to name a few off the top of my head.
Other market challenges that we have also directly seen:
- “Data Scientists” who know Hadoop, map reduce, Pig etc, and know it well, are very rare and expensive. Building a team that has the skills to be successful is difficult an expensive
- 80% of the effort in doing data analysis on Hadoop is simply performing classic data integration tasks. Without automation of the integration tasks, the expensive people noted above only have 20% of their time, if that, to dedicate to innovation.
I just gave a talk at the Global Big Data Conference on January 28th and my topic was on the risks that exist that could keep Hadoop from crossing the chasm into the mainstream. The same risks I just mentioned above. However, the second half of my talk covered what is happening that will allow Hadoop to cross the chasm.
Gartner is correct in that, Hadoop, by itself, is NOT a data integration platform. However, it can be made into a data integration platform. Lots of companies are investing in making Hadoop based integration easier. Informatica has done this by porting our Virtual Data Machine onto Hadoop to allow companies to use the same integration development environment they use today to build ETL jobs, and run those same jobs using Hadoop as the underlying engine. In fact, it is possible to also parse, profile, and cleanse data on Hadoop…. TODAY!
As I noted, we aren’t the only vendor investing in solving this problem. The market in general is moving in this direction so expect to see some exciting capabilities emerging over the next six months.
We know that companies who use this kind of graphical development environment over hand coding see a 5x time savings. We have also seen that people who are not Java/map reduce/Hadoop experts can develop quite sophisticated integration jobs that run in the Hadoop environment with the same skills they already have.
So technically, Gartner was not wrong. Hadoop itself is absolutely insufficient as a standalone architecture to be competitive for the majority of companies. This is a common thread that we are hearing from most companies. Once you take away the big web companies and some wall street firms, most companies do not want to hire an army of “data scientists” to use Hadoop for some breakthrough results.
However, Gartner missed the second half of the story. By taking traditional ETL platform development environments, and making them run on top of Hadoop companies are able to streamline development, assure confidence in their data and integration processes, and allow them to reach their potential by innovating without barriers.