Last year around this time, I wrote a blog about how the death of ETL was exaggerated. Time to revisit the topic briefly given a couple of interesting events that happened in the past few weeks.
First, one of the companies who had a senior executive that had claimed that ETL and the data integration layer was dead came by to visit. It turns out that the bold executive who claimed that everything they were doing had been migrated to Hadoop is no longer with that company. In addition, the thing they wanted to talk to us about what how they can more effectively build out the data warehouse and pull in mainframe, that’s right, mainframe data. It seems that old data sources never die, and they don’t even just fade away either. In fact, very little of what this company was doing was actually happening on Hadoop. Like I noted in my last blog, Hadoop is a lot like teenage sex.
Second, I gave a talk at a trade show on how new companies like Informatica were going to fill the ease of use gap on top of Hadoop by providing tooling so less skilled developers could also take advantage of Hadoop (for more on this topic, please check back on my blog titled “Dinner with my French Neighbor” ) . After my talk, a gentleman in his late 20’s came up to me and told me that he used to work for Aster Data, which was subsequently bought by Teradata. He had recently left to join a new startup. He used to think that the data integration layer would die away because you could easily use something like Aster to handle both the analytics queries and the data integration. Then after Aster was acquired by Teradata, he got to see an Informatica PowerCenter mapping that brought in a number of data sources, cleaned and integrated the data before moving it into Teradata. He told me that he hadn’t realized how complex real customer environments were and that there was no way that they could have done all of that integration in Aster. This is pretty typical of people who are new to the data space or who are building out Hadoop based startups. They don’t have to deal with legacy environments so they have no idea how messy they are until they finally see them first hand.
Third and last, someone from a startup company that I had talked to last year which has a visual data preparation and analytics environment on top of Hadoop sent me an email after Strata. I wasn’t at Strata, but he got my email address from one of my employees. He wanted to talk about partnering with us because their customers need to be able to handle more sophisticated data integration jobs ( connecting, cleansing, integrating, transforming, parsing etc) before their users can make use of the data. Only last year, this same company said that they were competing with Informatica because underneath their visualization layer, they had basic data integration transformation tools. As it turns out, basic wasn’t anywhere near enough so they are back talking to us about a partnership.
The point is that just because we can now dump all of our data into Hadoop, doesn’t mean it is integrated. If you take 10 legacy data sources plus internet data and sensor data and so on, and just dump it into Hadoop, it doesn’t make it integrated. It just makes it collocated. So while “ETL” in the classic sense will definitely change, the idea that there won’t be a data integration layer that exists to simplify and manage the integration of all of the old and new sources of data is just silly. That layer will continue to exist, it just might use a variety of technologies, including Hadoop, underneath as a storage and processing engine.
Regardless, I am happy to see that more and more companies are realizing that today’s data world is actually getting more complicated, not less complicated. The result, data fragmentation is only getting worse, so the future for data integration is only looking brighter.
So I missed Strata this year so I can only report back what I heard from my team. I was out on the road talking with customers while the gang was at Strata, talking to customers and prospective customers. That said, the conversations they had with new cool Hadoop companies were and my conversations were quite similar. Lots of talk about trials on Hadoop, but outside of the big internet firms, some startups that are focused on solving “big data” problems and some wall street firms, most companies are still kicking the Hadoop tires.
Which reminds me of a picture my neighbor took of a presentation that he saw on Hadoop. The presenter had a slide with a rehash of an old joke that went something like this (I am paraphrasing here as I don’t have the exact quote):
“Hadoop is a lot like teenage sex. Everyone says they do it, but most are not. And for those who are doing it, most of them aren’t very good at it yet. “
So if you haven’t gotten started on your Hadoop project, don’t worry, you aren’t as far behind as you think.
My wife invited my new neighbors over for dinner this past Saturday night. They are a French couple with a super cute 5 year old son. Dinner was nice, and like most ex-pats in the San Francisco Bay Area, he is in high tech. His company is a successful internet company in Europe, but have had a hard time penetrating the U.S. market which is why they moved to the Bay Area. He is starting up a satellite engineering organization in Palo Alto and he asked me where he can find good “big data” engineers. He is having a hard time finding people.
This is a story that I am hearing quite a bit with customers that I have been talking to as well. They want to start up big data teams, but can’t find enough skilled engineers who understand how to develop in PIG or HIVE or YARN or whatever is coming next in the Hadoop/map reduce world.
This reminds me of when I used to work in the telecom software business 20 years ago and everyone was looking at technologies like DCE and CORBA to build out distributed computing environments to solve complex problems that couldn’t be solved easily on a single computing system. If you don’t know what DCE or CORBA are/were, that’s OK. It is kind of the point. They are distributed computing development platforms that failed because they were too damn hard and there just weren’t enough people who could understand how to use them effectively. Now DCE and CORBA were not trying to solve the same problems as Hadoop, but the basic point still stands, they were damn hard and the reality is that programming on a Hadoop platform is damn hard as well.
So could Hadoop fail, just like CORBA and DCE. I doubt it, for a few key reasons. One… there is a considerable amount of venture and industrial investment going into Hadoop to make it work. Not since Java has there been such a concerted effort by the industry to try to make a new technology successful. Second, much of that investment is in providing graphical development environments and applications that use the storage and compute power of Hadoop, but hide its complexity. That is what Informatica is doing with PowerCenter Big Data Edition. We are making it possible for data integration developers to parse, cleanse, transform and integrate data using Hadoop as the underlying storage and engine. But the developer doesn’t have to know anything about Hadoop. The same thing is happening at the analytics layer, at the data prep layer and at the visualization layer.
Bit by bit, software vendors are hiding the underlying complexity of Hadoop so organizations won’t have to hire an army of big data scientists to solve interesting problems. They will still need a few of them, but not so many that Hadoop will end up like those other technologies that most Hadoop developers have never even heard of.
Power to the elephant. And more later about my dinner guest and his super cute 5 year old son.
I’ve spent the last few days in sunny London at Informatica UK’s flagship conference – Informatica Day “Put Information Potential to Work”.
Held at the Grange City on 8th October, we welcomed over 200 delegates made up of customers, prospects, partners and industry leading experts, and for the first time in the UK, Informatica’s Chairman and CEO Sohaib Abbasi, provided the visionary keynote and used the event to launch Informatica Vibe™ in the UK. Vibe is an embeddable data management engine that can access, aggregate, and manage any type of data. Vibe gives developers the power to map data once and deploy anywhere.
The day kicked off with Mike Ferguson, independent Industry Analyst and world-class speaker, providing a thought-provoking Keynote presentation looking into the Information Age landscape and how to deal with the data overload we all face as a consequence. Every seat in the conference room was taken with standing room only as Sohaib Abbasi took to the stage to deliver his keynote. He discussed the trends powering the new connected world and how organisations have an unprecedented opportunity to transform themselves by unleashing the true potential of information.
Following a quick coffee break, delegates broke into several streams – I had the privilege of leading the Next Generation Data Integration stream, a subject very close to my heart. To an enthusiastic and full room of existing customers and prospects keen to learn more, I discussed how data integration can become a core competency and how other companies have implemented a next generation data integration approach to reduce complexity and unleash the true potential of their information.
Stream two took on master data management and product information management. The audience heard customer presentations from ICON on improving clinical trial outcomes with MDM, and from Halfords who tackled the business case for product MDM.
Stream three explored how organisations could lower their costs of managing data through database archiving, test data management, data privacy and application retirement. And if that wasn’t enough, delegates could visit the exhibition hall throughout the day to find out about Smart Partitioning, MDM, question our specialists on mainframe connectivity, book on the latest Informatica University courses and demo the ever popular Data Validation and Proactive Monitoring tools.
A couple of customer comments that made me smile, “I am really happy to see the great progress for Informatica and what’s coming next”; “Very useful day – surprised how large the event was!”
The Edward Snowden affair has been out of the news for a few weeks, but I keep thinking about the trade-off that is being made around the use of data in the name of national security vs the use of much of that same kind data for the delivery of new services that people value. Whether you like what Snowden did or not, at least people have been talking about it. But the ability to search “metadata” about your phone calls is not so different from other kinds of information that people freely give up to be searched, whether they know it or not.
Take Facebook graph search as an example, you can find out a lot of information about people who have certain demographic characteristics who live in a specific region. All information that people have given up for free in Facebook is now searchable, unless you take active action to hide, block or remove that data. People publish their wish lists of things they want to buy on Amazon and then share them with others. The big idea is of course to provide more targeted advertising to sell you things you may actually want. The exact opposite of the kind of broadcast advertising we are so used to from big events like the Superbowl.
However, all of that information and the convenience that it potentially brings comes with a price, which is the loss of control of that data when it comes to personal privacy. Now there is a difference between private companies using this information and the government since private companies don’t have the ability to put you in jail. So their isn’t exactly an equivalency between the two. But if you give away information for the convenience of commerce, it is also out there for people to use it in manners that you also may not like.
Nevertheless, with the ability to actually analyze the petabytes of data that are now available, whether it is our phone calls, our friendship circles, our purchase patterns or the movies we watch, the discussion and debate around the tradeoff of using this information for more convenient commerce vs the use of that same information and more in the name of national security has only just begun.
I don’t mean to brag…. OK, yes, I mean to brag. That is kind of like when people say, “no pun intended” they actually mean “pun intended”. So I admit it, I mean to brag. Informatica is in the leaders quadrant of the Gartner Data Integration Magic Quadrant for the 7th year in a row. I don’t have enough fingers on my right hand to count that high! Pretty damn good.
But don’t take my word for it. If you want a FREE, yes, that’s right, a FREE copy of the Gartner Magic Quadrant report, just click here to download the Magic Quadrant Report and check out the report for yourself. Did I mention that it is FREE?
And do you know what else is FREE. PowerCenter Express! So after you read the Gartner Magic Quadrant report and it inspires you to want to try out Informatica’s market leading data integration platform, for FREE, just click on this link to Try PowerCenter Express for FREE.
Informatica is the only vendor in the leader’s quadrant that is confident enough to let you download our product and just try it out for FREE. The other “leaders” are afraid to let you do that. Sounds like Informatica is the only real leader in the leader’s quadrant… but that is just my opinion. And you don’t have to take my word for it because you can Try PowerCenter Express for FREE.
So enjoy the FREE Gartner report and our FREE entry level data integration product. I think that is enough FREE stuff for one day.
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
In my last blog post on the Vibe Virtual Data Machine (VDM), I wrote about the history of Vibe. Now I will cover a little more, at a high level, on what is in the Vibe Virtual Data Machine as well as a little bit of information on how it works.
The Informatica Vibe virtual data machine is a data management engine that knows how to ingest data and then very efficiently transform, cleanse, manage, or combine it with other data. It is the core engine that drives the Informatica Platform. You can’t buy the Vibe VDM standalone, it comes with every version of Informatica PowerCenter as well as other products like our federation services, PowerCenter Big Data Edition for Hadoop, Informatica Data Quality as well as the Informatica Cloud products.
The Vibe VDM works by receiving a set of instructions that describe the data source(s) from which it will extract data, the rules and flow by which that data will be transformed, analyzed, masked, archived, matched, or cleansed, and ultimately where that data will be loaded when the processing is finished.
The instructions set is generated by creating a graphical mapping of the data flow as well as the transformation and data cleansing logic that is part of that flow. The graphical instructions are then converted into code that Vibe then interprets as its instruction set. One other important thing to know about Vibe is that it is most often run as a standalone engine running on Linux, Unix or Windows. However, it also runs directly on Hadoop and when it is used as part of the Informatica Cloud set of products, it is a key component of the on premise agent that is controlled and managed by the Informatica Cloud.
Lastly the Vibe VDM is available for deployment as an SDK that can be embedded into an application. So instead of moving data to a data integration engine for processing, you can move the engine to the data. This concept of embedding a VDM into an application is the same idea as building an application on an application server. One way to think about Vibe is like a very use case specific application server specifically built for handling the data integration and quality aspects of an application.
Vibe consists of a number of fundamental components (see Figure below):
Transformation Library: This is a collection of useful, prebuilt transformations that the engine calls to combine, transform, cleanse, match, and mask data. For those familiar with PowerCenter or Informatica Data Quality, this library is represented by the icons that the developer can drag and drop onto the canvas to perform actions on data.
Optimizer: The Optimizer compiles data processing logic into internal representation to ensure effective resource usage and efficient run time based on data characteristics and execution environment configurations.
Executor: This is a run-time execution engine that orchestrates the data logic using the appropriate transformations. The engine reads/writes data from an adapter or directly streams the data from an application. The executor can physically move data or can present results via data virtualization.
Connectors: Informatica’s connectivity extensions provide data access from various data sources. This is what allows Informatica Platform users to connect to almost any data source or application for use by a variety of data movement technologies and modes, including batch, request/response, and publish/subscribe.
Vibe Software Development Kit (SDK): While not shown in the diagram above, Vibe provides APIs and extensions that allow third parties to add new connectors as well as transformations. So developers are not limited
Hopefully this brief overview helps you understand a little more about what Vibe is all about. If you have questions, post them below and either I or one of the Informatica team members will respond so you can understand how Vibe is going to energize the data integration industry.
For those of you hanging out at Informatica World, this is not news. For those of you who aren’t in Vegas with us, you missed the unveiling of the world’s best entry level data integration platform. So you heard it here second, not first. Next time, if you want to hear about this kind of stuff first, you have to show up at Informatica World! <shameless plug for INFAWorld 2013 complete>
So, what is it that I am bragging about? PowerCenter Express, that’s what. This is the latest addition to the Informatica PowerCenter family of products, specifically designed for entry level data integration and data profiling. This product will be downloadable over the Internet and installs in as little as 5 minutes. It is super simple to use but has all of the rich transformation functionality you are used to from Informatica. Also, you don’t have to install a separate profiling product, everything is self-contained. The product comes with built in “cheat sheets” that walk you through how to use the product in a step by step fashion. In addition, there is complete documentation as well as video based tutorials.
But best of all, PC Express delivers the kind of product quality you are accustomed to from Informatica. What does that mean? It means that unlike most of the entry level data integration products available for download, PC Express just works. It doesn’t crash just because your ETL job requires more memory than you have on your machine, it gracefully caches to disk.
But wait, there’s more. For the first time ever, Informatica is offering a FREE version of our market leading PowerCenter product. There will be two versions of PowerCenter Express:
- PowerCenter Express Personal Edition – available for FREE for a single developer at a time
- PowerCenter Express Professional Edition- available for $8K/user per year subscription (at the time of this blog post)
And one last important point. PC Express is based on the same virtual data machine as our enterprise class products and our cloud based products. This means that at some later date, if you decide you need more scalability, more users, or enterprise class features like high availability, you can easily migrate from PC Express to the other Informatica data integration product lines.
So if you are at Informatica World, you will be receiving an email outlining how you can download and try out PowerCenter Express. If you aren’t at Informatica World, maybe you have a friend who will share the secret website location where you can get a sneak peak at PowerCenter Express. If you don’t have any friends who went to Informatica world, well, you will just have to wait until the download site goes public in July. And next time you will know that you better go to Informatica World if you want to get early access to cool stuff.
This blog will be the first in a series about the something you will be hearing more about in the near future from Informatica, the Vibe™ virtual data machine or VDM for short.
So what is a virtual data machine? A virtual data machine (VDM) is an embeddable data management engine that accesses, aggregates and manages data.
Now that you understand what VDM is, what is Vibe? Vibe is simplly the branded name for the virtual data machine.
With that out of the way, here is a little more background on the history of the Vibe Virtual Data Machine for your reading pleasure:
The History of the Virtual Data Machine
Since the founding of Informatica Corporation 20 years ago, we have always had a philosophy of separating the development of data integration from the actual run-time implementation. This is what Informatica means when we say that the Informatica® PowerCenter® data integration product is metadata driven. The term “metadata driven” means that a developer does not have to know C, C++, or Java to perform data integration. The developer operates in a graphical development environment using drag-and-drop tools to visualize how data will move from system A, then be combined with data from system B, and then ultimately be cleansed and transformed when it finally arrives at system C. At the most detailed level of the development process, you might see icons representing data sets, and lines representing relationships coming out of those data sets going into other data sets, with descriptions of how that data is transformed along the way.
Figure 1: Informatica Developer drag-and-drop graphical development environment
However, you do not see code, just the metadata describing how the data will be modified along the way. The idea is that a person who is knowledgeable about data integration concepts, but is not necessarily a software developer, can develop data integration jobs to convert raw data into high-quality information that allows organizations to put their data potential to work. The implication is that far more people are able to develop data integration jobs because through the use of graphical tools, we have “democratized” data integration development.
Over time, however, data integration has become more complicated. It has moved from just being extract, transform, and load (ETL) for batch movement of data to also include data quality, real-time data, data virtualization, and now Hadoop. In addition, the integration process can be deployed both on premise and in the cloud. As data integration has become more complex, it has forced the use of a blended approach that
often requires the use of many or most of the capabilities and approaches just mentioned while the mix and match of underlying technologies keeps expanding.
This entire time, Informatica has continued to separate the development environment from the underlying data movement and transformation technology. Why is this separation so important? It is important because as new data integration approaches come along, with new deployment models like software as a service (SaaS), new technologies such as Hadoop, and new languages such as Pig and Hive and even yet to be invented languages, existing data integration developers don’t have to learn the details of how the new technology works in order to take advantage of it. In addition, the pace at which the underlying technologies are changing in the data integration and management market is increasing. So as this pace quickens, by separating development from deployment, end-users can continue to design and develop using the same interface, and under the covers, they can take advantage of new kinds of data movement and transformation engines to virtualize data, move it in batch, move it in real time, or integrate big data, without having to learn the details of the underlying language, system, or framework.
Hopefully that gives you a good intro into the history of the VDM. In my next blog installment, I will write a little more the basics of the Vibe VDM and how it works. So stay tuned, same Vibe time, same Vibe channel.