Nine years ago when I started in the data integration and quality space, data quality was all about algorithms and cleansing technology. Data went in, and the “best” solution was the one that could do the best job of fuzzy matching the data and cleaning more data than the other products. Of course, not data quality solution could clean 100% of the data so “exceptions” were dumped into a file that were left as an “exercise for the user” to deal with on their own. This usually meant using the data management product of choice, when there is nothing else…. Data goes into a spreadsheet, and then users would remediate the mistakes by hand in the spreadsheet. Then someone would write an SQL query to write the corrections back into the database. In the end, managing the exceptions was a very manual process with very little to no governance to the process.
The problem with this of course is that for very many companies, data stewardship is not the person’s day job. So if they have to spend time checking to see if someone else has corrected an error in the data, or getting approval to make a data change, or spending time then consolidating all the manual changes they made and then communicating those changes to management, then they don’t have much time left to sleep, much less eat. In the end, but business of creating quality data just doesn’t get done, or doesn’t get done well. In the end, data quality is a business issue, supported by IT, but the business facing part of the solution has been missing.
But that is about to change. Informatica already provides the most scalable data quality product for handling the automated portion of the data quality process. And now, in the latest release of Informatica Data Quality 9.6, we have created a new edition called the Data Quality Governance Edition to fully manage the exception process. This edition provides a completely governed process for managing remediation of data exceptions by business data stewards. It allows organizations to create their own customized process with different levels of review. Additionally, it makes it possible for business users to create their own data quality rules, describing the rules in plain language…. no coding necessary.
And of course, every organization wants to be able to track how they are improving. And Informatica Data Quality 9.6 includes embeddable dashboards that show the progress of how data quality is improving and impacting the business in a positive way.
Great data isn’t an accident. Great data happens by design. And for the first time, data cleansing has been combined with a holistic data stewardship process, allowing business and IT to collaborate to create quality data that supports critical business processes.
Over the last 40 years, data has become increasingly distributed. It used to all sit on storage connected to a mainframe. It used to be that the application of computing power to solve business problems was limited by the availability of CPU, memory, network and disk. Those limitations are no longer big inhibitors. Data fragmentation is now the new inhibitor to business agility. Data is now generated from distributed data sources not just within a corporation, but from business partners, from device sensors and from consumers Facebook-ing and tweeting away on the internet.
So to solve any interesting business problem in today’s fragmented data world, you now have to pull data together from a wide variety of data sources. That means business agility 100% depends on data integration agility. But how do you do deliver that agility in a way that is not just fast, but reliable, and delivers high quality data?
First, to achieve data integration agility, you need to move from a traditional waterfall development process to an agile development process.
Second, if you need reliability, you have to think about how you start treating your data integration process as a critical business process. That means thinking about how you will make your integration processes highly available. It also means you need to monitor and validate your operational data integration processes on an ongoing basis. The good news is that the capabilities you need for data validation as well as operational monitoring and alerting for your data integration process are now built into Informatica’s newest PowerCenter Edition, PowerCenter Premium Edition.
Lastly, the days where you can just move data from A to B without including a data quality process are over. Great data doesn’t happen by accident, it happens by design. And that means you also have to build in data quality directly into your data integration process.
Great businesses depend on great data. And great data means data that is delivered on time, with confidence and with high quality. So think about how your understanding of data integration and great data can make your career. Great businesses depend on great data and people like you who have the skills to make a difference. As a data professional, the time has never been better for you to make a contribution to the greatness of your organization. You have the opportunity to make a difference and have an impact because your skills and your understanding of data integration has never been more critical.
When I was seven years old, Danny Weiss had a birthday party where we played the telephone game. The idea is this: there are 8 people sitting around a table, the first person tells the next person a little story. They tell the next person, the story, and so on, all the way around the room. At the end of the game, you compare the original story that the first person tells and compare it to the story the 8th person tells. Of course, the stories are very different and everyone giggles hysterically… we were seven years old after all.
The reason I was thinking about this story is that data integration development is similarly inefficient as a seven year old birthday party. The typical process is that a business analyst, using the knowledge in their head about the business applications they are responsible for, creates a spreadsheet in Microsoft Excel that has a list of database tables and columns along with a set of business rules for how the data is to be transformed as it moved to a target system (a data warehouse or another application). The spreadsheet, which is never checked against real data, is then passed to a developer who then creates code in separate system in order to move the data, which is then checked by a QA person which is then checked again by the business analyst at the end of the process. This is the first time the business analyst verifies their specification against real data.
99 times out of 100, the data in the target system doesn’t match what the business analyst was expecting. Why? Either the original specification was wrong because the business analyst had a typo or the data is inaccurate. Or the data in the original system wasn’t organized the way the analyst thought it was organized. Or the developer misinterpreted the spreadsheet. Or the business analyst simply doesn’t need this data anymore – he needs some other data. The result is lots of errors, just like the telephone game. And the only way to fix it is with rework and then more rework.
But there is a better way. What if the data analyst could validate their specification against real data and self correct on the fly before passing the specification to the developer. What if the specification were not just a specification, but a prototype that could be passed directly to the developer who wouldn’t recode it, but would just modify it to add scalability and reliability? The result is much less rework and much faster time to development. In fact, up to 5 times faster.
That is what Agile Data integration is all about. Rapid prototyping and self-validation against real data up front by the business analyst. Sharing of results in a common toolset back and forth to the developer to improve the accuracy of communication.
Because we believe the agile process is so important to your success, Informatica is giving all of our PowerCenter Standard Edition (and higher editions) customers agile data integration for FREE!!! That’s right, if you are a current customer of Informatica PowerCenter, we are giving you the tools you need to go from the old fashion error-prone, waterfall, telephone game style of development to a modern 21st century Agile process.
• FREE rapid prototyping and data profiling for the data analyst.
• Go from prototype to production with no recoding.
• Better communication and better collaboration between analyst and developer
PowerCenter 9.6. Agile Data Integration built in. No more telephone game. It doesn’t get any better than that.
Ah, the new year is almost upon us, so here are Todd’s 2014 predictions:
1) Real companies, not just internet advertising based businesses and startups, will stop just talking about big data and actually implement real solutions focused on 2 areas:
- Data warehouse offloading where companies take all of that data that is just sitting in their enterprise data warehouses and move the data they aren’t accessing into Hadoop where they will now preprocess the data in Hadoop and put that output into the EDW where they then report on it. The big driver for this? Cost savings.
- Predictive analytics based on collecting, integrating, cleansing and analyzing large amounts of sensor data. The data has always been available, it is just that analyzing it en masse has always been problematic. The big driver for this? Operational efficiency of the systems that are being monitored as well as feedback analysis to build the next generation of efficient systems.
That said, even more companies will still be talking about big data than are actually doing it. There is still a long learning curve.
2) Organizations will starting thinking about how to transition their data management infrastructure from a cost center into a profit center. As more companies identify ways they can take existing data about their customers, products, suppliers, partners etc., they will start identifying ways they can generate new revenue streams by repackaging this data into information products.
3) Data quality will continue to stink. This one is a sure thing. It never ceases to amaze me how people think that either their data isn’t bad, or that they can’t do anything about it or that thanks to big data, they don’t have to worry about data quality because of the law of large numbers. Laugh if you want at that last one on the law of large numbers, but I have heard that story at least three different times this year. Just for clarification for those of you who think that more data means that the dirty data becomes a statistical anomaly, it only means that you have the same percentage of dirty data as before…. You just have more of it.
4) More about big data… There still won’t be enough people who understand Hadoop. There will be lots of vendors (including Informatica) creating cool new tools so average developers and even business users can integrate, cleanse and analyze data on Hadoop without having to know anything about Hadoop. The hype might die down, but this will actually be an even more exciting area.
5) More business self service will come to the data integration space. With the proliferation of data, it is impossible for IT to service all of the integration needs of the business. So the only solution will be the growth of self-service integration capabilities that let business users and shadow IT do integration on their own. While this has existed already, the big change will be that corporate IT will start offering these services to their internal customers so departments can do their own integration but within a framework that is managed and supported by corporate IT. It is the very beginning of this trend, but expect to see IT start to get more control by giving control to their business users.
6) The Pittsburgh Steelers will make it back into the playoffs….. and for a 2015 prediction, my beloved Steelers will however not make it to the Super Bowl, but my adopted home team, the Santa Clara J 49ers, will make it to the Super Bowl and will win.
Happy New Year everyone.
Last year around this time, I wrote a blog about how the death of ETL was exaggerated. Time to revisit the topic briefly given a couple of interesting events that happened in the past few weeks.
First, one of the companies who had a senior executive that had claimed that ETL and the data integration layer was dead came by to visit. It turns out that the bold executive who claimed that everything they were doing had been migrated to Hadoop is no longer with that company. In addition, the thing they wanted to talk to us about what how they can more effectively build out the data warehouse and pull in mainframe, that’s right, mainframe data. It seems that old data sources never die, and they don’t even just fade away either. In fact, very little of what this company was doing was actually happening on Hadoop. Like I noted in my last blog, Hadoop is a lot like teenage sex.
Second, I gave a talk at a trade show on how new companies like Informatica were going to fill the ease of use gap on top of Hadoop by providing tooling so less skilled developers could also take advantage of Hadoop (for more on this topic, please check back on my blog titled “Dinner with my French Neighbor” ) . After my talk, a gentleman in his late 20’s came up to me and told me that he used to work for Aster Data, which was subsequently bought by Teradata. He had recently left to join a new startup. He used to think that the data integration layer would die away because you could easily use something like Aster to handle both the analytics queries and the data integration. Then after Aster was acquired by Teradata, he got to see an Informatica PowerCenter mapping that brought in a number of data sources, cleaned and integrated the data before moving it into Teradata. He told me that he hadn’t realized how complex real customer environments were and that there was no way that they could have done all of that integration in Aster. This is pretty typical of people who are new to the data space or who are building out Hadoop based startups. They don’t have to deal with legacy environments so they have no idea how messy they are until they finally see them first hand.
Third and last, someone from a startup company that I had talked to last year which has a visual data preparation and analytics environment on top of Hadoop sent me an email after Strata. I wasn’t at Strata, but he got my email address from one of my employees. He wanted to talk about partnering with us because their customers need to be able to handle more sophisticated data integration jobs ( connecting, cleansing, integrating, transforming, parsing etc) before their users can make use of the data. Only last year, this same company said that they were competing with Informatica because underneath their visualization layer, they had basic data integration transformation tools. As it turns out, basic wasn’t anywhere near enough so they are back talking to us about a partnership.
The point is that just because we can now dump all of our data into Hadoop, doesn’t mean it is integrated. If you take 10 legacy data sources plus internet data and sensor data and so on, and just dump it into Hadoop, it doesn’t make it integrated. It just makes it collocated. So while “ETL” in the classic sense will definitely change, the idea that there won’t be a data integration layer that exists to simplify and manage the integration of all of the old and new sources of data is just silly. That layer will continue to exist, it just might use a variety of technologies, including Hadoop, underneath as a storage and processing engine.
Regardless, I am happy to see that more and more companies are realizing that today’s data world is actually getting more complicated, not less complicated. The result, data fragmentation is only getting worse, so the future for data integration is only looking brighter.
So I missed Strata this year so I can only report back what I heard from my team. I was out on the road talking with customers while the gang was at Strata, talking to customers and prospective customers. That said, the conversations they had with new cool Hadoop companies were and my conversations were quite similar. Lots of talk about trials on Hadoop, but outside of the big internet firms, some startups that are focused on solving “big data” problems and some wall street firms, most companies are still kicking the Hadoop tires.
Which reminds me of a picture my neighbor took of a presentation that he saw on Hadoop. The presenter had a slide with a rehash of an old joke that went something like this (I am paraphrasing here as I don’t have the exact quote):
“Hadoop is a lot like teenage sex. Everyone says they do it, but most are not. And for those who are doing it, most of them aren’t very good at it yet. “
So if you haven’t gotten started on your Hadoop project, don’t worry, you aren’t as far behind as you think.
My wife invited my new neighbors over for dinner this past Saturday night. They are a French couple with a super cute 5 year old son. Dinner was nice, and like most ex-pats in the San Francisco Bay Area, he is in high tech. His company is a successful internet company in Europe, but have had a hard time penetrating the U.S. market which is why they moved to the Bay Area. He is starting up a satellite engineering organization in Palo Alto and he asked me where he can find good “big data” engineers. He is having a hard time finding people.
This is a story that I am hearing quite a bit with customers that I have been talking to as well. They want to start up big data teams, but can’t find enough skilled engineers who understand how to develop in PIG or HIVE or YARN or whatever is coming next in the Hadoop/map reduce world.
This reminds me of when I used to work in the telecom software business 20 years ago and everyone was looking at technologies like DCE and CORBA to build out distributed computing environments to solve complex problems that couldn’t be solved easily on a single computing system. If you don’t know what DCE or CORBA are/were, that’s OK. It is kind of the point. They are distributed computing development platforms that failed because they were too damn hard and there just weren’t enough people who could understand how to use them effectively. Now DCE and CORBA were not trying to solve the same problems as Hadoop, but the basic point still stands, they were damn hard and the reality is that programming on a Hadoop platform is damn hard as well.
So could Hadoop fail, just like CORBA and DCE. I doubt it, for a few key reasons. One… there is a considerable amount of venture and industrial investment going into Hadoop to make it work. Not since Java has there been such a concerted effort by the industry to try to make a new technology successful. Second, much of that investment is in providing graphical development environments and applications that use the storage and compute power of Hadoop, but hide its complexity. That is what Informatica is doing with PowerCenter Big Data Edition. We are making it possible for data integration developers to parse, cleanse, transform and integrate data using Hadoop as the underlying storage and engine. But the developer doesn’t have to know anything about Hadoop. The same thing is happening at the analytics layer, at the data prep layer and at the visualization layer.
Bit by bit, software vendors are hiding the underlying complexity of Hadoop so organizations won’t have to hire an army of big data scientists to solve interesting problems. They will still need a few of them, but not so many that Hadoop will end up like those other technologies that most Hadoop developers have never even heard of.
Power to the elephant. And more later about my dinner guest and his super cute 5 year old son.
I’ve spent the last few days in sunny London at Informatica UK’s flagship conference – Informatica Day “Put Information Potential to Work”.
Held at the Grange City on 8th October, we welcomed over 200 delegates made up of customers, prospects, partners and industry leading experts, and for the first time in the UK, Informatica’s Chairman and CEO Sohaib Abbasi, provided the visionary keynote and used the event to launch Informatica Vibe™ in the UK. Vibe is an embeddable data management engine that can access, aggregate, and manage any type of data. Vibe gives developers the power to map data once and deploy anywhere.
The day kicked off with Mike Ferguson, independent Industry Analyst and world-class speaker, providing a thought-provoking Keynote presentation looking into the Information Age landscape and how to deal with the data overload we all face as a consequence. Every seat in the conference room was taken with standing room only as Sohaib Abbasi took to the stage to deliver his keynote. He discussed the trends powering the new connected world and how organisations have an unprecedented opportunity to transform themselves by unleashing the true potential of information.
Following a quick coffee break, delegates broke into several streams – I had the privilege of leading the Next Generation Data Integration stream, a subject very close to my heart. To an enthusiastic and full room of existing customers and prospects keen to learn more, I discussed how data integration can become a core competency and how other companies have implemented a next generation data integration approach to reduce complexity and unleash the true potential of their information.
Stream two took on master data management and product information management. The audience heard customer presentations from ICON on improving clinical trial outcomes with MDM, and from Halfords who tackled the business case for product MDM.
Stream three explored how organisations could lower their costs of managing data through database archiving, test data management, data privacy and application retirement. And if that wasn’t enough, delegates could visit the exhibition hall throughout the day to find out about Smart Partitioning, MDM, question our specialists on mainframe connectivity, book on the latest Informatica University courses and demo the ever popular Data Validation and Proactive Monitoring tools.
A couple of customer comments that made me smile, “I am really happy to see the great progress for Informatica and what’s coming next”; “Very useful day – surprised how large the event was!”
The Edward Snowden affair has been out of the news for a few weeks, but I keep thinking about the trade-off that is being made around the use of data in the name of national security vs the use of much of that same kind data for the delivery of new services that people value. Whether you like what Snowden did or not, at least people have been talking about it. But the ability to search “metadata” about your phone calls is not so different from other kinds of information that people freely give up to be searched, whether they know it or not.
Take Facebook graph search as an example, you can find out a lot of information about people who have certain demographic characteristics who live in a specific region. All information that people have given up for free in Facebook is now searchable, unless you take active action to hide, block or remove that data. People publish their wish lists of things they want to buy on Amazon and then share them with others. The big idea is of course to provide more targeted advertising to sell you things you may actually want. The exact opposite of the kind of broadcast advertising we are so used to from big events like the Superbowl.
However, all of that information and the convenience that it potentially brings comes with a price, which is the loss of control of that data when it comes to personal privacy. Now there is a difference between private companies using this information and the government since private companies don’t have the ability to put you in jail. So their isn’t exactly an equivalency between the two. But if you give away information for the convenience of commerce, it is also out there for people to use it in manners that you also may not like.
Nevertheless, with the ability to actually analyze the petabytes of data that are now available, whether it is our phone calls, our friendship circles, our purchase patterns or the movies we watch, the discussion and debate around the tradeoff of using this information for more convenient commerce vs the use of that same information and more in the name of national security has only just begun.
I don’t mean to brag…. OK, yes, I mean to brag. That is kind of like when people say, “no pun intended” they actually mean “pun intended”. So I admit it, I mean to brag. Informatica is in the leaders quadrant of the Gartner Data Integration Magic Quadrant for the 7th year in a row. I don’t have enough fingers on my right hand to count that high! Pretty damn good.
But don’t take my word for it. If you want a FREE, yes, that’s right, a FREE copy of the Gartner Magic Quadrant report, just click here to download the Magic Quadrant Report and check out the report for yourself. Did I mention that it is FREE?
And do you know what else is FREE. PowerCenter Express! So after you read the Gartner Magic Quadrant report and it inspires you to want to try out Informatica’s market leading data integration platform, for FREE, just click on this link to Try PowerCenter Express for FREE.
Informatica is the only vendor in the leader’s quadrant that is confident enough to let you download our product and just try it out for FREE. The other “leaders” are afraid to let you do that. Sounds like Informatica is the only real leader in the leader’s quadrant… but that is just my opinion. And you don’t have to take my word for it because you can Try PowerCenter Express for FREE.
So enjoy the FREE Gartner report and our FREE entry level data integration product. I think that is enough FREE stuff for one day.
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.