Well, it’s been a little over a week since the Strata conference so I thought I should give some perspective on what I learned. I think it was summed up at my first meeting, on the first morning of the conference. The meeting was with a financial services company who has significance experience with Hadoop. The first words out of their mouths were, “Hadoop is hard.”
Later in the conference, after a Western Union representative spoke about their Hadoop deployment, they were mobbed by end user questions and comments. The audience was thrilled to hear about an actual operational deployment: Not just a sandbox deployment, but an actual operational Hadoop deployment from a company that is over 160 years old.
The market is crossing the chasm from early adopters who love to hand code (and the macho culture of proving they can do the hard stuff) to more mainstream companies that want to use technology to solve real problems. These mainstream companies aren’t afraid to admit that it is still hard. For the early adopters, nothing is ever hard. They love hard. But the mainstream market doesn’t view it that way. They don’t want to mess around in the bowels of enabling technology. They want to use the technology to solve real problems. The comment from the financial services company represents the perspective of the vast majority of organizations. It is a sign Hadoop is hitting the mainstream market.
More proof we have moved to a new phase? Cloudera announced they were going from shipping six versions a year down to just three. I have been saying for awhile that we will know that Hadoop is real when the distribution vendors stop shipping every 2 months and go to a more typical enterprise software release schedule. It isn’t that Hadoop engineering efforts have slowed down. It is still evolving very rapidly. It is just that real customers are telling the Hadoop suppliers that they won’t upgrade as fast because they have real business projects running and they can’t do it. So for those of you who are disappointed by the “slow down,” don’t be. To me, this is news that Hadoop is reaching critical mass.
Technology is closing the gap to allow organizations to use Hadoop as a platform without having to actually have an army of Hadoop experts. That is what Informatica does for data parsing, data integration, data quality and data lineage (recent product announcement). In fact, the number one demo at the Informatica booth at Strata was the demonstration of “end to end” data lineage for data, going from the original source all the way to how it was loaded and then transformed within Hadoop. This is purely an enterprise-class capability that becomes more interesting and important when you actually go into true production.
Informatica’s goal is to hide the complexity of Hadoop so companies can get on with the work of using the platform with the skills they already have in house. And from what I saw from all of the start-up companies that were doing similar things for data exploration and analytics and all the talk around the need for governance, we are finally hitting the early majority of the market. So, for those of you who still drop down to the underlying UNIX OS that powers a Mac, the rest of us will keep using the GUI. To the extent that there are “fit for purpose” GUIs on top of Hadoop, the technology will get used by a much larger market.
So congratulations Hadoop, you have officially crossed the chasm!
P.S. See me on theCUBE talking about a similar topic at: youtu.be/oC0_5u_0h2Q
Ah yes, the Old Mainframe. It just won’t go away. Which means there is still valuable data sitting in it. And that leads to a question that I have been asked about repeatedly in the past few weeks, about why an organization should use a tool like Informatica PowerExchange to extract data from a mainframe when you can also do it with a script that extracts the data as a flat file.
So below, thanks to Phil Line, Informatica’s Product Manager for Mainframe connectivity, are the top ten reasons to use PowerExchange over hand coding a flat file extraction.
1) Data will be “fresh” as of the time the data is needed – not already old based on when the extraction was run.
2) Any data extracted directly from files will be as the file held it, any additional processes needed to run in order to extract/transfer data to LUW could potentially alter the original formats.
3) The consuming application can get the data when it needs it; there wouldn’t be any scheduling issues between creating the extract file and then being able to use it.
4) There is less work to do if PowerExchange reads the data directly from the mainframe, data type processing as well as potential code page issues are all handled by PowerExchange.
5) Unlike any files created with ftp type processes, where problems could cut short the expected data transfer, PowerExchange/PowerCenter provide log messages so as to ensure that all data has been processed.
6) The consumer has the capacity only to select the data that is needed for the consumer application, use of filtering can reduce the amount of data being transferred as well as any potential security aspects.
7) Any data access of mainframe based data can be secured according to the security tools in place on the mainframe; PowerExchange is fully compliant to RACF, ACF2 & Top-Secret security products.
8) Using Informatica’s PowerExchange, along with Informatica consuming tools (PowerCenter, Mercury etc.) provides a much simpler and cleaner architecture. The simpler the architecture the easier it is to find problems as well as audit the processes that are touching the data.
9) PowerExchange generally can help avoid the normal bottlenecks associated to getting data off of the mainframe, programmers are not needed to create the extract processes, new schedules don’t need to be created to ensure that the extracts run, in the event of changes being necessary they can be controlled by the Business group consuming the data.
10) Helps control mainframe data extraction processes that are still being run but from which no one uses the generated data as the original system that requested the data has now become obsolete.
While the rest of you may not get that excited about the latest and greatest Gartner Magic Quadrant report, we sure get excited about it around here. And with good reason. Once again, for the 8th year in a row, if I am not mistaken, Informatica is in the leader’s quadrant of the Gartner Magic Quadrant for Data Integration Tools and we are positioned highest in ability to execute and furthest in completeness of vision within the leaders quadrant. So I have to say, I am pretty excited about that because we believe it speaks to our vision and execution among vendors…. And that’s the fact jack.
So if you still don’t get why we are so excited, I will just quote Navin R Johnson (from the movie: The Jerk) who, upon seeing his name published in the phone book stated, “This is the kind of spontaneous publicity – your name in print – that makes people. I’m in print! Things are going to start happening to me now.” (If you want to see Navin’s reaction click here)
Well, things are already happening for Informatica. And more importantly, for our customers who are using our market leading data integration platform to accelerate their mission critical data projects whether they are on premise, in the cloud or even on Hadoop. But don’t take my word for it. Download and read the latest Gartner Magic Quadrant for Data Integration Tools report for yourself, and find out why we are once again in the leadership quadrant.
Now I am going to take the rest of the day off, to celebrate!
Disclaimer – Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
Nine years ago when I started in the data integration and quality space, data quality was all about algorithms and cleansing technology. Data went in, and the “best” solution was the one that could do the best job of fuzzy matching the data and cleaning more data than the other products. Of course, not data quality solution could clean 100% of the data so “exceptions” were dumped into a file that were left as an “exercise for the user” to deal with on their own. This usually meant using the data management product of choice, when there is nothing else…. Data goes into a spreadsheet, and then users would remediate the mistakes by hand in the spreadsheet. Then someone would write an SQL query to write the corrections back into the database. In the end, managing the exceptions was a very manual process with very little to no governance to the process.
The problem with this of course is that for very many companies, data stewardship is not the person’s day job. So if they have to spend time checking to see if someone else has corrected an error in the data, or getting approval to make a data change, or spending time then consolidating all the manual changes they made and then communicating those changes to management, then they don’t have much time left to sleep, much less eat. In the end, but business of creating quality data just doesn’t get done, or doesn’t get done well. In the end, data quality is a business issue, supported by IT, but the business facing part of the solution has been missing.
But that is about to change. Informatica already provides the most scalable data quality product for handling the automated portion of the data quality process. And now, in the latest release of Informatica Data Quality 9.6, we have created a new edition called the Data Quality Governance Edition to fully manage the exception process. This edition provides a completely governed process for managing remediation of data exceptions by business data stewards. It allows organizations to create their own customized process with different levels of review. Additionally, it makes it possible for business users to create their own data quality rules, describing the rules in plain language…. no coding necessary.
And of course, every organization wants to be able to track how they are improving. And Informatica Data Quality 9.6 includes embeddable dashboards that show the progress of how data quality is improving and impacting the business in a positive way.
Great data isn’t an accident. Great data happens by design. And for the first time, data cleansing has been combined with a holistic data stewardship process, allowing business and IT to collaborate to create quality data that supports critical business processes.
Over the last 40 years, data has become increasingly distributed. It used to all sit on storage connected to a mainframe. It used to be that the application of computing power to solve business problems was limited by the availability of CPU, memory, network and disk. Those limitations are no longer big inhibitors. Data fragmentation is now the new inhibitor to business agility. Data is now generated from distributed data sources not just within a corporation, but from business partners, from device sensors and from consumers Facebook-ing and tweeting away on the internet.
So to solve any interesting business problem in today’s fragmented data world, you now have to pull data together from a wide variety of data sources. That means business agility 100% depends on data integration agility. But how do you do deliver that agility in a way that is not just fast, but reliable, and delivers high quality data?
First, to achieve data integration agility, you need to move from a traditional waterfall development process to an agile development process.
Second, if you need reliability, you have to think about how you start treating your data integration process as a critical business process. That means thinking about how you will make your integration processes highly available. It also means you need to monitor and validate your operational data integration processes on an ongoing basis. The good news is that the capabilities you need for data validation as well as operational monitoring and alerting for your data integration process are now built into Informatica’s newest PowerCenter Edition, PowerCenter Premium Edition.
Lastly, the days where you can just move data from A to B without including a data quality process are over. Great data doesn’t happen by accident, it happens by design. And that means you also have to build in data quality directly into your data integration process.
Great businesses depend on great data. And great data means data that is delivered on time, with confidence and with high quality. So think about how your understanding of data integration and great data can make your career. Great businesses depend on great data and people like you who have the skills to make a difference. As a data professional, the time has never been better for you to make a contribution to the greatness of your organization. You have the opportunity to make a difference and have an impact because your skills and your understanding of data integration has never been more critical.
When I was seven years old, Danny Weiss had a birthday party where we played the telephone game. The idea is this: there are 8 people sitting around a table, the first person tells the next person a little story. They tell the next person, the story, and so on, all the way around the room. At the end of the game, you compare the original story that the first person tells and compare it to the story the 8th person tells. Of course, the stories are very different and everyone giggles hysterically… we were seven years old after all.
The reason I was thinking about this story is that data integration development is similarly inefficient as a seven year old birthday party. The typical process is that a business analyst, using the knowledge in their head about the business applications they are responsible for, creates a spreadsheet in Microsoft Excel that has a list of database tables and columns along with a set of business rules for how the data is to be transformed as it moved to a target system (a data warehouse or another application). The spreadsheet, which is never checked against real data, is then passed to a developer who then creates code in separate system in order to move the data, which is then checked by a QA person which is then checked again by the business analyst at the end of the process. This is the first time the business analyst verifies their specification against real data.
99 times out of 100, the data in the target system doesn’t match what the business analyst was expecting. Why? Either the original specification was wrong because the business analyst had a typo or the data is inaccurate. Or the data in the original system wasn’t organized the way the analyst thought it was organized. Or the developer misinterpreted the spreadsheet. Or the business analyst simply doesn’t need this data anymore – he needs some other data. The result is lots of errors, just like the telephone game. And the only way to fix it is with rework and then more rework.
But there is a better way. What if the data analyst could validate their specification against real data and self correct on the fly before passing the specification to the developer. What if the specification were not just a specification, but a prototype that could be passed directly to the developer who wouldn’t recode it, but would just modify it to add scalability and reliability? The result is much less rework and much faster time to development. In fact, up to 5 times faster.
That is what Agile Data integration is all about. Rapid prototyping and self-validation against real data up front by the business analyst. Sharing of results in a common toolset back and forth to the developer to improve the accuracy of communication.
Because we believe the agile process is so important to your success, Informatica is giving all of our PowerCenter Standard Edition (and higher editions) customers agile data integration for FREE!!! That’s right, if you are a current customer of Informatica PowerCenter, we are giving you the tools you need to go from the old fashion error-prone, waterfall, telephone game style of development to a modern 21st century Agile process.
• FREE rapid prototyping and data profiling for the data analyst.
• Go from prototype to production with no recoding.
• Better communication and better collaboration between analyst and developer
PowerCenter 9.6. Agile Data Integration built in. No more telephone game. It doesn’t get any better than that.
Ah, the new year is almost upon us, so here are Todd’s 2014 predictions:
1) Real companies, not just internet advertising based businesses and startups, will stop just talking about big data and actually implement real solutions focused on 2 areas:
- Data warehouse offloading where companies take all of that data that is just sitting in their enterprise data warehouses and move the data they aren’t accessing into Hadoop where they will now preprocess the data in Hadoop and put that output into the EDW where they then report on it. The big driver for this? Cost savings.
- Predictive analytics based on collecting, integrating, cleansing and analyzing large amounts of sensor data. The data has always been available, it is just that analyzing it en masse has always been problematic. The big driver for this? Operational efficiency of the systems that are being monitored as well as feedback analysis to build the next generation of efficient systems.
That said, even more companies will still be talking about big data than are actually doing it. There is still a long learning curve.
2) Organizations will starting thinking about how to transition their data management infrastructure from a cost center into a profit center. As more companies identify ways they can take existing data about their customers, products, suppliers, partners etc., they will start identifying ways they can generate new revenue streams by repackaging this data into information products.
3) Data quality will continue to stink. This one is a sure thing. It never ceases to amaze me how people think that either their data isn’t bad, or that they can’t do anything about it or that thanks to big data, they don’t have to worry about data quality because of the law of large numbers. Laugh if you want at that last one on the law of large numbers, but I have heard that story at least three different times this year. Just for clarification for those of you who think that more data means that the dirty data becomes a statistical anomaly, it only means that you have the same percentage of dirty data as before…. You just have more of it.
4) More about big data… There still won’t be enough people who understand Hadoop. There will be lots of vendors (including Informatica) creating cool new tools so average developers and even business users can integrate, cleanse and analyze data on Hadoop without having to know anything about Hadoop. The hype might die down, but this will actually be an even more exciting area.
5) More business self service will come to the data integration space. With the proliferation of data, it is impossible for IT to service all of the integration needs of the business. So the only solution will be the growth of self-service integration capabilities that let business users and shadow IT do integration on their own. While this has existed already, the big change will be that corporate IT will start offering these services to their internal customers so departments can do their own integration but within a framework that is managed and supported by corporate IT. It is the very beginning of this trend, but expect to see IT start to get more control by giving control to their business users.
6) The Pittsburgh Steelers will make it back into the playoffs….. and for a 2015 prediction, my beloved Steelers will however not make it to the Super Bowl, but my adopted home team, the Santa Clara J 49ers, will make it to the Super Bowl and will win.
Happy New Year everyone.
Last year around this time, I wrote a blog about how the death of ETL was exaggerated. Time to revisit the topic briefly given a couple of interesting events that happened in the past few weeks.
First, one of the companies who had a senior executive that had claimed that ETL and the data integration layer was dead came by to visit. It turns out that the bold executive who claimed that everything they were doing had been migrated to Hadoop is no longer with that company. In addition, the thing they wanted to talk to us about what how they can more effectively build out the data warehouse and pull in mainframe, that’s right, mainframe data. It seems that old data sources never die, and they don’t even just fade away either. In fact, very little of what this company was doing was actually happening on Hadoop. Like I noted in my last blog, Hadoop is a lot like teenage sex.
Second, I gave a talk at a trade show on how new companies like Informatica were going to fill the ease of use gap on top of Hadoop by providing tooling so less skilled developers could also take advantage of Hadoop (for more on this topic, please check back on my blog titled “Dinner with my French Neighbor” ) . After my talk, a gentleman in his late 20’s came up to me and told me that he used to work for Aster Data, which was subsequently bought by Teradata. He had recently left to join a new startup. He used to think that the data integration layer would die away because you could easily use something like Aster to handle both the analytics queries and the data integration. Then after Aster was acquired by Teradata, he got to see an Informatica PowerCenter mapping that brought in a number of data sources, cleaned and integrated the data before moving it into Teradata. He told me that he hadn’t realized how complex real customer environments were and that there was no way that they could have done all of that integration in Aster. This is pretty typical of people who are new to the data space or who are building out Hadoop based startups. They don’t have to deal with legacy environments so they have no idea how messy they are until they finally see them first hand.
Third and last, someone from a startup company that I had talked to last year which has a visual data preparation and analytics environment on top of Hadoop sent me an email after Strata. I wasn’t at Strata, but he got my email address from one of my employees. He wanted to talk about partnering with us because their customers need to be able to handle more sophisticated data integration jobs ( connecting, cleansing, integrating, transforming, parsing etc) before their users can make use of the data. Only last year, this same company said that they were competing with Informatica because underneath their visualization layer, they had basic data integration transformation tools. As it turns out, basic wasn’t anywhere near enough so they are back talking to us about a partnership.
The point is that just because we can now dump all of our data into Hadoop, doesn’t mean it is integrated. If you take 10 legacy data sources plus internet data and sensor data and so on, and just dump it into Hadoop, it doesn’t make it integrated. It just makes it collocated. So while “ETL” in the classic sense will definitely change, the idea that there won’t be a data integration layer that exists to simplify and manage the integration of all of the old and new sources of data is just silly. That layer will continue to exist, it just might use a variety of technologies, including Hadoop, underneath as a storage and processing engine.
Regardless, I am happy to see that more and more companies are realizing that today’s data world is actually getting more complicated, not less complicated. The result, data fragmentation is only getting worse, so the future for data integration is only looking brighter.
So I missed Strata this year so I can only report back what I heard from my team. I was out on the road talking with customers while the gang was at Strata, talking to customers and prospective customers. That said, the conversations they had with new cool Hadoop companies were and my conversations were quite similar. Lots of talk about trials on Hadoop, but outside of the big internet firms, some startups that are focused on solving “big data” problems and some wall street firms, most companies are still kicking the Hadoop tires.
Which reminds me of a picture my neighbor took of a presentation that he saw on Hadoop. The presenter had a slide with a rehash of an old joke that went something like this (I am paraphrasing here as I don’t have the exact quote):
“Hadoop is a lot like teenage sex. Everyone says they do it, but most are not. And for those who are doing it, most of them aren’t very good at it yet. “
So if you haven’t gotten started on your Hadoop project, don’t worry, you aren’t as far behind as you think.
My wife invited my new neighbors over for dinner this past Saturday night. They are a French couple with a super cute 5 year old son. Dinner was nice, and like most ex-pats in the San Francisco Bay Area, he is in high tech. His company is a successful internet company in Europe, but have had a hard time penetrating the U.S. market which is why they moved to the Bay Area. He is starting up a satellite engineering organization in Palo Alto and he asked me where he can find good “big data” engineers. He is having a hard time finding people.
This is a story that I am hearing quite a bit with customers that I have been talking to as well. They want to start up big data teams, but can’t find enough skilled engineers who understand how to develop in PIG or HIVE or YARN or whatever is coming next in the Hadoop/map reduce world.
This reminds me of when I used to work in the telecom software business 20 years ago and everyone was looking at technologies like DCE and CORBA to build out distributed computing environments to solve complex problems that couldn’t be solved easily on a single computing system. If you don’t know what DCE or CORBA are/were, that’s OK. It is kind of the point. They are distributed computing development platforms that failed because they were too damn hard and there just weren’t enough people who could understand how to use them effectively. Now DCE and CORBA were not trying to solve the same problems as Hadoop, but the basic point still stands, they were damn hard and the reality is that programming on a Hadoop platform is damn hard as well.
So could Hadoop fail, just like CORBA and DCE. I doubt it, for a few key reasons. One… there is a considerable amount of venture and industrial investment going into Hadoop to make it work. Not since Java has there been such a concerted effort by the industry to try to make a new technology successful. Second, much of that investment is in providing graphical development environments and applications that use the storage and compute power of Hadoop, but hide its complexity. That is what Informatica is doing with PowerCenter Big Data Edition. We are making it possible for data integration developers to parse, cleanse, transform and integrate data using Hadoop as the underlying storage and engine. But the developer doesn’t have to know anything about Hadoop. The same thing is happening at the analytics layer, at the data prep layer and at the visualization layer.
Bit by bit, software vendors are hiding the underlying complexity of Hadoop so organizations won’t have to hire an army of big data scientists to solve interesting problems. They will still need a few of them, but not so many that Hadoop will end up like those other technologies that most Hadoop developers have never even heard of.
Power to the elephant. And more later about my dinner guest and his super cute 5 year old son.