Category Archives: Data Integration Platform
Unlike some of my friends, History was a subject in high school and college that I truly enjoyed. I particularly appreciated biographies of favorite historical figures because it painted a human face and gave meaning and color to the past. I also vowed at that time to navigate my life and future under the principle attributed to Harvard professor Jorge Agustín Nicolás Ruiz de Santayana y Borrás that goes, “Those who cannot remember the past are condemned to repeat it.”
So that’s a little ditty regarding my history regarding history.
Forwarding now to the present in which I have carved out my career in technology, and in particular, enterprise software, I’m afforded a great platform where I talk to lots of IT and business leaders. When I do, I usually ask them, “How are you implementing advanced projects that help the business become more agile or effective or opportunistically proactive?” They usually answer something along the lines of “this is the age and renaissance of data science and analytics” and then end up talking exclusively about their meat and potatoes business intelligence software projects and how 300 reports now run their business.
Then when I probe and hear their answer more in depth, I am once again reminded of THE history quote and think to myself there’s an amusing irony at play here. When I think about the Business Intelligence systems of today, most are designed to “remember” and report on the historical past through large data warehouses of a gazillion transactions, along with basic, but numerous shipping and billing histories and maybe assorted support records.
But when it comes right down to it, business intelligence “history” is still just that. Nothing is really learned and applied right when and where it counted – AND when it would have made all the difference had the company been able to react in time.
So, in essence, by using standalone BI systems as they are designed today, companies are indeed condemned to repeat what they have already learned because they are too late – so the same mistakes will be repeated again and again.
This means the challenge for BI is to reduce latency, measure the pertinent data / sensors / events, and get scalable – extremely scalable and flexible enough to handle the volume and variety of the forthcoming data onslaught.
There’s a part 2 to this story so keep an eye out for my next blog post History Repeats Itself (Part 2)
Last year around this time, I wrote a blog about how the death of ETL was exaggerated. Time to revisit the topic briefly given a couple of interesting events that happened in the past few weeks.
First, one of the companies who had a senior executive that had claimed that ETL and the data integration layer was dead came by to visit. It turns out that the bold executive who claimed that everything they were doing had been migrated to Hadoop is no longer with that company. In addition, the thing they wanted to talk to us about what how they can more effectively build out the data warehouse and pull in mainframe, that’s right, mainframe data. It seems that old data sources never die, and they don’t even just fade away either. In fact, very little of what this company was doing was actually happening on Hadoop. Like I noted in my last blog, Hadoop is a lot like teenage sex.
Second, I gave a talk at a trade show on how new companies like Informatica were going to fill the ease of use gap on top of Hadoop by providing tooling so less skilled developers could also take advantage of Hadoop (for more on this topic, please check back on my blog titled “Dinner with my French Neighbor” ) . After my talk, a gentleman in his late 20’s came up to me and told me that he used to work for Aster Data, which was subsequently bought by Teradata. He had recently left to join a new startup. He used to think that the data integration layer would die away because you could easily use something like Aster to handle both the analytics queries and the data integration. Then after Aster was acquired by Teradata, he got to see an Informatica PowerCenter mapping that brought in a number of data sources, cleaned and integrated the data before moving it into Teradata. He told me that he hadn’t realized how complex real customer environments were and that there was no way that they could have done all of that integration in Aster. This is pretty typical of people who are new to the data space or who are building out Hadoop based startups. They don’t have to deal with legacy environments so they have no idea how messy they are until they finally see them first hand.
Third and last, someone from a startup company that I had talked to last year which has a visual data preparation and analytics environment on top of Hadoop sent me an email after Strata. I wasn’t at Strata, but he got my email address from one of my employees. He wanted to talk about partnering with us because their customers need to be able to handle more sophisticated data integration jobs ( connecting, cleansing, integrating, transforming, parsing etc) before their users can make use of the data. Only last year, this same company said that they were competing with Informatica because underneath their visualization layer, they had basic data integration transformation tools. As it turns out, basic wasn’t anywhere near enough so they are back talking to us about a partnership.
The point is that just because we can now dump all of our data into Hadoop, doesn’t mean it is integrated. If you take 10 legacy data sources plus internet data and sensor data and so on, and just dump it into Hadoop, it doesn’t make it integrated. It just makes it collocated. So while “ETL” in the classic sense will definitely change, the idea that there won’t be a data integration layer that exists to simplify and manage the integration of all of the old and new sources of data is just silly. That layer will continue to exist, it just might use a variety of technologies, including Hadoop, underneath as a storage and processing engine.
Regardless, I am happy to see that more and more companies are realizing that today’s data world is actually getting more complicated, not less complicated. The result, data fragmentation is only getting worse, so the future for data integration is only looking brighter.
My wife invited my new neighbors over for dinner this past Saturday night. They are a French couple with a super cute 5 year old son. Dinner was nice, and like most ex-pats in the San Francisco Bay Area, he is in high tech. His company is a successful internet company in Europe, but have had a hard time penetrating the U.S. market which is why they moved to the Bay Area. He is starting up a satellite engineering organization in Palo Alto and he asked me where he can find good “big data” engineers. He is having a hard time finding people.
This is a story that I am hearing quite a bit with customers that I have been talking to as well. They want to start up big data teams, but can’t find enough skilled engineers who understand how to develop in PIG or HIVE or YARN or whatever is coming next in the Hadoop/map reduce world.
This reminds me of when I used to work in the telecom software business 20 years ago and everyone was looking at technologies like DCE and CORBA to build out distributed computing environments to solve complex problems that couldn’t be solved easily on a single computing system. If you don’t know what DCE or CORBA are/were, that’s OK. It is kind of the point. They are distributed computing development platforms that failed because they were too damn hard and there just weren’t enough people who could understand how to use them effectively. Now DCE and CORBA were not trying to solve the same problems as Hadoop, but the basic point still stands, they were damn hard and the reality is that programming on a Hadoop platform is damn hard as well.
So could Hadoop fail, just like CORBA and DCE. I doubt it, for a few key reasons. One… there is a considerable amount of venture and industrial investment going into Hadoop to make it work. Not since Java has there been such a concerted effort by the industry to try to make a new technology successful. Second, much of that investment is in providing graphical development environments and applications that use the storage and compute power of Hadoop, but hide its complexity. That is what Informatica is doing with PowerCenter Big Data Edition. We are making it possible for data integration developers to parse, cleanse, transform and integrate data using Hadoop as the underlying storage and engine. But the developer doesn’t have to know anything about Hadoop. The same thing is happening at the analytics layer, at the data prep layer and at the visualization layer.
Bit by bit, software vendors are hiding the underlying complexity of Hadoop so organizations won’t have to hire an army of big data scientists to solve interesting problems. They will still need a few of them, but not so many that Hadoop will end up like those other technologies that most Hadoop developers have never even heard of.
Power to the elephant. And more later about my dinner guest and his super cute 5 year old son.
Everyone knows that Informatica is the Data Integration company that helps organizations connect their disparate software into a cohesive and synchronous enterprise information system. The value to business is enormous and well documented in the form of use cases, ROI studies and loyalty / renewal rates that are industry-leading.
Event Processing, on the other hand is a technology that has been around only for a few years now and has yet to reach Main Street in Systems City, IT. But if you look at how event processing is being used, it’s amazing that more people haven’t heard about it. The idea at its core (pun intended) is very simple – monitor your data / events – those things that happen on a daily, hourly, minute-ly basis and then look for important patterns that are positive or negative indicators, and then set up your systems to automatically take action when those patterns come up – like notify a sales rep when a pattern indicates a customer is ready to buy, or stop that transaction, your company is about to be defrauded.
Since this is an Informatica blog, then you probably have a decent set of “muscles” in place already and so why, you ask, would you need 6 pack abs? Because 6 packs abs are a good indication of a strong musculature core and are the basis of a stable and highly athletic body. It’s the same parallel for companies because in today’s competitive business environment, you need strength, stability, and agility to compete. And since IT systems increasingly ARE the business, if your company isn’t performing as strong, lean, and mean as possible, then you can be sure your competitors will be looking to implement every advantage they can.
You may also be thinking why would you need something like Event Processing when you already have good Business Intelligence systems in place? The reality is that it’s not easy to monitor and measure useful but sometimes hidden data /event / sensor / social media sources and also to discern which patterns have meaning and which patterns may be discovered as false negatives. But the real difference is that BI usually reports to you after the fact when the value of acting on the situation has diminished significantly.
So while muscles are important to be able to stand up and run, and good quality, strong muscles are necessary to do heavy lifting, it’s those 6 pack abs on top of it all that give you the mean lean fighting machine to identify significant threats and opportunities amongst your data, and in essence, to better compete and win.
Marketing is changing how we leverage data. In the past, we had rudimentary use of data to understand how marketing campaigns affect demand. Today, we focus on the customer. The shift is causing those in marketing to get good at data, and good at data integration. These data points are beginning to appear, as are the clear and well-defined links between data integration and marketing.
There is no better data point than Yesmail Interactive’s recent survey of 100 senior-level marketers at companies with online and offline sales models, and $10 million to more than $1 billion in revenues. My good friend, Loraine Lawson, outlined this report in a recent blog.
The resulting report, “Customer Lifecycle Engagement: Imperatives for mid-to-large companies,” (link requires sign up) shows many midsize and large B2C “marketers lack the data and technology they need for more effective segmentation.”
The report lists a few proof points:
- 86 percent of marketers say they could generate more revenue from customers if they had access to a more complete picture of customer attributes.
- 34 percent cited both poor data quality and fragmented systems as among the most significant barriers to personalized customer communications.
- On a similar note, only 46 percent were satisfied with data quality.
- 48 percent were satisfied with their web analytics integration.
- 47 percent were satisfied with their customer data integration.
- 41 percent of marketers incorporate web browsing and online behavior data in targeting criteria—although one-third said they plan to leverage this source in the future.
- Only 20 percent augment in-house customer data with third-party data at the customer level.
- Only 24 percent augment customer data at an aggregate level (such as the industry or region). Compare that to 58 percent who say they either purchase or plan to purchase third-party data to augment customer records, primarily to “validate data integrity.”
Considering this data, it’s pretty easy to draw the conclusions that those in marketing don’t have access to the customer data required to effectively do their jobs. Thus, those in enterprise IT who support marketing should take steps to leverage the right data integration processes and technologies to provide them access to the necessary analytical data.
The report includes a list of key recommendations, all of which center around four key strategic imperatives:
- Marketing data must shift from stagnant data silos to real-time data access.
- Marketing data must shift from campaign-centric to customer-centric.
- Marketing data must shift from non-integrated multichannel to integrated multichannel. Marketing must connect analytics, strategy and the creative.
If case you have not noticed, in order to carry out these recommendations, you need a sound focus on data integration, as well as higher-end analytical systems, which will typically leverage big data-types of technologies. For those in marketing, the effective use of customer and other data is key to understanding their marketplace, which is key to focusing marketing efforts and creating demand. The links with marketing and data integration are stronger than ever.
I don’t mean to brag…. OK, yes, I mean to brag. That is kind of like when people say, “no pun intended” they actually mean “pun intended”. So I admit it, I mean to brag. Informatica is in the leaders quadrant of the Gartner Data Integration Magic Quadrant for the 7th year in a row. I don’t have enough fingers on my right hand to count that high! Pretty damn good.
But don’t take my word for it. If you want a FREE, yes, that’s right, a FREE copy of the Gartner Magic Quadrant report, just click here to download the Magic Quadrant Report and check out the report for yourself. Did I mention that it is FREE?
And do you know what else is FREE. PowerCenter Express! So after you read the Gartner Magic Quadrant report and it inspires you to want to try out Informatica’s market leading data integration platform, for FREE, just click on this link to Try PowerCenter Express for FREE.
Informatica is the only vendor in the leader’s quadrant that is confident enough to let you download our product and just try it out for FREE. The other “leaders” are afraid to let you do that. Sounds like Informatica is the only real leader in the leader’s quadrant… but that is just my opinion. And you don’t have to take my word for it because you can Try PowerCenter Express for FREE.
So enjoy the FREE Gartner report and our FREE entry level data integration product. I think that is enough FREE stuff for one day.
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
In the Information Age we live and work in, where it’s hard to go even one day without a Google search, where do you turn for insights that can help you solve work challenges and progress your career? This is a tough question. How can we deal with the challenges of information overload – which some have called information pollution? (more…)
In my last blog post on the Vibe Virtual Data Machine (VDM), I wrote about the history of Vibe. Now I will cover a little more, at a high level, on what is in the Vibe Virtual Data Machine as well as a little bit of information on how it works.
The Informatica Vibe virtual data machine is a data management engine that knows how to ingest data and then very efficiently transform, cleanse, manage, or combine it with other data. It is the core engine that drives the Informatica Platform. You can’t buy the Vibe VDM standalone, it comes with every version of Informatica PowerCenter as well as other products like our federation services, PowerCenter Big Data Edition for Hadoop, Informatica Data Quality as well as the Informatica Cloud products.
The Vibe VDM works by receiving a set of instructions that describe the data source(s) from which it will extract data, the rules and flow by which that data will be transformed, analyzed, masked, archived, matched, or cleansed, and ultimately where that data will be loaded when the processing is finished.
The instructions set is generated by creating a graphical mapping of the data flow as well as the transformation and data cleansing logic that is part of that flow. The graphical instructions are then converted into code that Vibe then interprets as its instruction set. One other important thing to know about Vibe is that it is most often run as a standalone engine running on Linux, Unix or Windows. However, it also runs directly on Hadoop and when it is used as part of the Informatica Cloud set of products, it is a key component of the on premise agent that is controlled and managed by the Informatica Cloud.
Lastly the Vibe VDM is available for deployment as an SDK that can be embedded into an application. So instead of moving data to a data integration engine for processing, you can move the engine to the data. This concept of embedding a VDM into an application is the same idea as building an application on an application server. One way to think about Vibe is like a very use case specific application server specifically built for handling the data integration and quality aspects of an application.
Vibe consists of a number of fundamental components (see Figure below):
Transformation Library: This is a collection of useful, prebuilt transformations that the engine calls to combine, transform, cleanse, match, and mask data. For those familiar with PowerCenter or Informatica Data Quality, this library is represented by the icons that the developer can drag and drop onto the canvas to perform actions on data.
Optimizer: The Optimizer compiles data processing logic into internal representation to ensure effective resource usage and efficient run time based on data characteristics and execution environment configurations.
Executor: This is a run-time execution engine that orchestrates the data logic using the appropriate transformations. The engine reads/writes data from an adapter or directly streams the data from an application. The executor can physically move data or can present results via data virtualization.
Connectors: Informatica’s connectivity extensions provide data access from various data sources. This is what allows Informatica Platform users to connect to almost any data source or application for use by a variety of data movement technologies and modes, including batch, request/response, and publish/subscribe.
Vibe Software Development Kit (SDK): While not shown in the diagram above, Vibe provides APIs and extensions that allow third parties to add new connectors as well as transformations. So developers are not limited
Hopefully this brief overview helps you understand a little more about what Vibe is all about. If you have questions, post them below and either I or one of the Informatica team members will respond so you can understand how Vibe is going to energize the data integration industry.