“The report of my death was an exaggeration.”
– Mark Twain
Ah yes, another conference another old technology is declared dead. Mainframe… dead. Any programming language other than Java…. dead. 8 track tapes …OK, well some things thankfully do die, along with the Ford Pinto that I used to listen to the Beatles Greatest Hits Red Album over and over again on that 8 track… ah yes the good old days, but I digress.
So Hadoop World has come and gone along with all of the folks declaring the end of days for ETL. Yes, Hadoop is going to take over the world,that is why there are so many Hadoop clusters deployed in mainstream businesses already. That last sentence was sarcasm for those of you who can’t read between the lines.
It feels as if this article has been written before by so many others. I remember when I worked for AOL-Netscape running the shop@aol website and one of my younger engineers spent three months writing a few thousand lines of Java code to solve some problem we needed to resolve. Java was going to take over the world for all code being written everywhere. This young man had a hammer, so everything looked like a Java nail. Anyway, I had recently brought in a new director of engineering who happened to do a code review of this young engineer’s code. That night, my more experienced director wrote up 100 lines of Perl script to replace the few thousand lines of Java code. The point isn’t that Java was bad, it was that just in this particular case, Perl was more efficient for the problem at hand. People still use Perl, they still use Cobol, but they also use C, C++, C# and of course Java.
The point isn’t that Perl is better than Java, it is just that there are different tools that are better for different circumstances.
Similarly, ETL is not going to die a sudden death and Hadoop won’t take over the world either. The two technologies will however nicely coexist along with other approaches like data virtualization, data replication and change data capture etc. My own take on the whole thing is that like so many technology revolutions that are not specifically about storage media (which does seem to cycle and kill out the previous storage medium like 8 track), we will see a nice progression where developers learn how to use the different data movement and access technologies in the most efficient and appropriate manner for their particular strengths and weaknesses.
The big breakthrough will be approaches like the one we have taken at Informatica where we have a virtual data machine that protects the user from the complexities of the underlying physical data layer. So a developer can develop a mapping for moving or transforming data, and decide to deploy it as ETL jobs on Hadoop or traditional data platforms, as ELT pushdown jobs on a RDBMS or data warehouse appliance, or to deploy access via data virtualization – one mapping, multiple deployment models. Keep in mind that most of our customers couldn’t even spell Hadoop a few years ago, so this kind of virtual data machine future proofs their investment by making sure that when new technologies do come along, users can take advantage of them without having to start over from scratch. So while Hadoop provides another processing “engine”, it isn’t a one-size fit all solution. Other approaches will still be needed.
Once again, all of this is a very long way to say the rumors of the death of ETL are greatly exaggerated. Let’s not go back to stones, knives and bear skins and hand-code ETL, but instead adopt the lessons learned from the past which teach us to use tools that maximize productivity.
For those of you interested in a more thoughtful exploration of the topic, you should check out a Gartner paper titled “Understanding the Logical Data Warehouse: The Emerging Practice” by Mark Beyer and Roxanne Edjlali from June of 2012.