Category Archives: Data Integration Platform
In last 50-60 years, we have witnessed another revolution, through the invention of computing machines and the Internet – a digital revolution. It has transformed every industry and allowed us to operate at far greater scale – processing more transactions and in more locations – than ever before. New cities emerged on the map, migrations of knowledge workers throughout the world followed, and the standard of living increased again. And digitally available information transformed how we run businesses, cities, or countries.
Forces Shaping Digital Revolution
Over the last 5-6 years, we’ve witnessed a massive increase in the volume and variety of this information. Leading forces that contributed to this increase are:
- Next generation of software technology connecting data faster from any source
- Little to no hardware cost to process and store huge amount of data (Moore’s Law)
- A sharp increase in number of machines and devices generating data that are connected online
- Massive worldwide growth of people connecting online and sharing information
- Speed of Internet connectivity that’s now free in many public places
As a result, our engagement with the digital world is rising – both for personal and business purposes. Increasingly, we play games, shop, sign digital contracts, make product recommendations, respond to customer complains, share patient data, and make real time pricing changes to in-store products – all from a mobile device or laptop. We do so increasingly in a collaborative way, in real-time, and in a very personalized fashion. Big Data, Social, Cloud, and Internet of Things are key topics dominating our conversations and thoughts around data these days. They are altering our ways to engage with and expectations from each other.
This is the emergence of a new revolution or it is the next phase of our digital revolution – the democratization and ubiquity of information to create new ways of interacting with customers and dramatically speeding up market launch. Businesses will build new products and services and create new business models by exploiting this vast new resource of information.
The Quest for Great Data
But, there is work to do before one can unleash the true potential captured in data. Data is no more a by-product or transaction record. Neither it has anymore an expiration date. Data now flows through like a river fueling applications, business processes, and human or machine activities. New data gets created on the way and augments our understanding of the meaning behind this data. It is no longer good enough to have good data in isolated projects, but rather great data need to become accessible to everyone and everything at a moment’s notice. This rich set of data needs to connect efficiently to information that has been already present and learn from it. Such data need to automatically rid itself of inaccurate and incomplete information. Clean, safe, and connected – this data is now ready to find us even before we discover it. It understands the context in which we are going to make use of this information and key decisions that will follow. In the process, this data is learning about our usage, preference, and results. What works versus what doesn’t. New data is now created that captures such inherent understanding or intelligence. It needs to flow back to appropriate business applications or machines for future usage after fine-tuning. Such data can then tell a story about human or machine actions and results. Such data can become a coach, a mentor, a friend of kind to guide us through critical decision points. Such data is what we would like to call great data. In order to truly capitalize on the next step of digital revolution, we will pervasively need this great data to power our decisions and thinking.
Impacting Every Industry
By 2020, there’ll be 50 Billion connected devices, 7x more than human beings on the planet. With this explosion of devices and associated really big data that will be processed and stored increasingly in the cloud. More than size, this complexity will require a new way of addressing business process efficiency that renders agility, simplicity, and capacity. Impact of such transformation will spread across many industries. A McKinsey article, “The Future of Global Payments”, focuses on digital transformation of payment systems in the banking industry and ubiquity of data as a result. One of the key challenges for banks will be to shift from their traditional heavy reliance on siloed and proprietary data to a more open approach that encompasses a broader view of customers.
Industry executives, front line managers, and back office workers are all struggling to make the most sense of the data that’s available.
Closing Thoughts on Great Data
A “2014 PWC Global CEO Survey ” showed 81% ranked technology advances as #1 factor to transform their businesses over next 5 years. More data, by itself, isn’t enough for this transformation. A robust data management approach integrating machine and human data, from all sources and updated in real-time, among on-premise and cloud-based systems must be put in place to accomplish this mission. Such an approach will nurture great data. This end-to-end data management platform will provide data guidance and curate an organization’s one of the most valuable assets, its information. Only by making sense of what we have at our disposal, will we unleash the true potential of the information that we possess. The next step in the digital revolution will be about organizations of all sizes being fueled by great data to unleash their potential tapped.
According to the article, in Hamilton County Ohio, it’s not unusual to see kids from the same neighborhoods coming to the hospital for asthma attacks. Thus, researchers wanted to know if it was fact or mistaken perception that an unusually high number of children in the same neighborhood were experiencing asthma attacks. The next step was to review existing data to determine the extent of the issues, and perhaps how to solve the problem altogether.
“The researchers studied 4,355 children between the ages of 1 and 16 who visited the emergency department or were hospitalized for asthma at Cincinnati Children’s between January 2009 and December 2012. They tracked those kids for 12 months to see if they returned to the ED or were readmitted for asthma.”
Not only were the researchers able to determine a sound correlation between the two data sets, but they were able to advance the research to predict which kids were at high-risk based upon where they live. Thus, some of the cause and the effects have been determined.
This came about when researchers began thinking out of the box, when it comes to dealing with traditional and non-traditional medical data. They integrated housing and census data, in this case, with that of the data from the diagnosis and treatment of the patients. These are data sets unlikely to find their way to each other, but together they have a meaning that is much more valuable than if they just stayed in their respective silos.
“Non-traditional medical data integration has begun to take place in some medical collaborative environments already. The New York-Presbyterian Regional Health Collaborative created a medical village, which ‘goes beyond the established patient-centered medical home mode.’ It not only connects an academic medical center with a large ambulatory network, medical homes, and other providers with each other, but community resources such as school-based clinics and specialty-care centers (the ones that are a part of NYP’s network).”
The fact of the matter is that data is the key to understanding what the heck is going on when cells of sick people begin to emerge. While researchers and doctors can treat the individual patients there is not a good understanding of the larger issues that may be at play. In this case, poor air quality in poor neighborhoods. Thus, they understand what problem needs to be corrected.
The universal sharing of data is really the larger solution here, but one that won’t be approached without a common understanding of the value, and funding. As we pass laws around the administration of health care, as well as how data is to be handled, perhaps it’s time we look at what the data actually means. This requires a massive deployment of data integration technology, and the fundamental push to share data with a central data repository, as well as with health care providers.
The white paper, “The Great Rethink: Building a Highly Responsive and Evolving Data Integration Architecture” by Claudia Imhoff and Joe McKendrick provides an interesting view of what such an architecture might look like. The paper describes how to move from ad hoc Data Integration to an Enterprise Data Architecture. The paper also describes an approach towards building architectural maturity and a next-generation enterprise data architecture that helps organizations to be more competitive.
Organizations that look to compete based on their data are searching for ways to design an architecture that:
- On-boards new data quickly
- Delivers clean and trustworthy data
- Delivers data at the speed required of the business
- Ensures that data is handled in secure way
- Is flexible enough to incorporate new data types and new technology
- Enables end user self-service
- Speeds up the speed of business value delivery for an organization
In my previous blog, Digital Strategy and Architecture, we discussed the demands that digital strategies are putting on enterprise data architecture in particular. Add to that the additional stress from business initiatives such as:
- Supporting new mobile applications
- Moving IT applications to the cloud – which significantly increases data management complexity
- Dealing with external data. One recent study estimates that a full 25% of the data being managed by the average organization is external data.
- Next-generation analytics and predictive analytics with Hadoop and No SQL
- Integrating analytics with applications
- Event-driven architectures and projects
- The list goes on…
The point here is that most people are unlikely to be funded to build an enterprise data architecture from scratch that can meet all these needs. A pragmatic approach would be to build out your future state architecture in each new strategic business initiative that is implemented. The real challenge of being an enterprise architect is ensuring that all of the new work does indeed add up to a coherent architecture as it gets implemented.
The “Great Rethink” white paper describes a practical approach to achieving an agile and responsive future state enterprise data architecture that will support your strategic business initiatives. It also describes a high level data integration architecture and the building blocks to achieving that architecture. This is highly recommended reading.
Also, you might recall that Informatica sponsored the Informatica Architect’s Challenge this year to design an enterprise-wide data architecture of the future. The contest has closed and we have a winner. See the site for details, Informatica Architect Challenge .
This is a guest author post by Philip Howard, Research Director, Bloor Research.
I recently posted a blog about an interview style webcast I was doing with Informatica on the uses and costs associated with data integration tools.
I’m not sure that the poet John Donne was right when he said that it was strange, let alone fatal. Somewhat surprisingly, I have had a significant amount of feedback following this webinar. I say “surprisingly” because the truth is that I very rarely get direct feedback. Most of it, I assume, goes to the vendor. So, when a number of people commented to me that the research we conducted was both unique and valuable, it was a bit of a thrill. (Yes, I know, I’m easily pleased).
There were a number of questions that arose as a result of our discussions. Probably the most interesting was whether moving data into Hadoop (or some other NoSQL database) should be treated as a separate use case. We certainly didn’t include it as such in our original research. In hindsight, I’m not sure that the answer I gave at the time was fully correct. I acknowledged that you certainly need some different functionality to integrate with a Hadoop environment and that some vendors have more comprehensive capabilities than others when it comes to Hadoop and the same also applies (but with different suppliers, when it comes to integrating with, say, MongoDB or Cassandra or graph databases). However, as I pointed out in my previous blog, functionality is ephemeral. And, just because a particular capability isn’t supported today, doesn’t mean it won’t be supported tomorrow. So that doesn’t really affect use cases.
However, where I was inadequate in my reply was that I only referenced Hadoop as a platform for data warehousing, stating that moving data into Hadoop was not essentially different from moving it into Oracle Exadata or Teradata or HP Vertica. And that’s true. What I forgot was the use of Hadoop as an archiving platform. As it happens we didn’t have an archiving use case in our survey either. Why not? Because archiving is essentially a form of data migration – you have some information lifecycle management and access and security issues that are relevant to archiving once it is in place but that is after the fact: the process of discovering and moving the data is exactly the same as with data migration. So: my bad.
Aside from that little caveat, I quite enjoyed the whole event. Somebody or other (there’s always one!) didn’t quite get how quantifying the number of end points in a data integration scenario was a surrogate measure for complexity (something we took into account) and so I had to explain that. Of course, it’s not perfect as a metric but it’s the only alternative to ask eye of the beholder type questions which aren’t very satisfactory.
Anyway, if you want to listen to the whole thing you can find it HERE:
The post is by Philip Howard, Research Director, Bloor Research.
One of the standard metrics used to support buying decisions for enterprise software is total cost of ownership. Typically, the other major metric is functionality. However functionality is ephemeral. Not only does it evolve with every new release but while particular features may be relevant to today’s project there is no guarantee that those same features will be applicable to tomorrow’s needs. A broader metric than functionality is capability: how suitable is this product for a range of different project scenarios and will it support both simple and complex environments?
Earlier this year Bloor Research published some research into the data integration market, which exactly investigated these issues: how often were tools reused, how many targets and sources were involved, for what sort of projects were products deemed suitable? And then we compared these with total cost of ownership figures that we also captured in our survey. I will be discussing the results of our research live with Kristin Kokie, who is the interim CIO of Informatica, on Guy Fawkes’ day (November 5th). I don’t promise anything explosive but it should be interesting and I hope you can join us. The discussions will be vendor neutral (mostly: I expect that Kristin has a degree of bias).
To Register for the Webinar, click Here.
Key findings from the report include:
- 65% of organizations cite data processing and integration as hampering distribution capability, with nearly half claiming their existing software and ERP is not suitable for distribution.
- Nearly two-thirds of enterprises have some form of distribution process, involving products or services.
- More than 80% of organizations have at least some problem with product or service distribution.
- More than 50% of CIOs in organizations with distribution processes believe better distribution would increase revenue and optimize business processes, with a further 38% citing reduced operating costs.
The core findings: “With better data integration comes better automation and decision making.”
This report is one of many I’ve seen over the years that come to the same conclusion. Most of those involved with the operations of the business don’t have access to key data points they need, thus they can’t automate tactical decisions, and also cannot “mine” the data, in terms of understanding the true state of the business.
The more businesses deal with building and moving products, the more data integration becomes an imperative value. As stated in this survey, as well as others, the large majority cite “data processing and integration as hampering distribution capabilities.”
Of course, these issues goes well beyond Australia. Most enterprises I’ve dealt with have some gap between the need to share key business data to support business processes, and decision support, and what current exists in terms of data integration capabilities.
The focus here is on the multiple values that data integration can bring. This includes:
- The ability to track everything as it moves from manufacturing, to inventory, to distribution, and beyond. You to bind these to core business processes, such as automatic reordering of parts to make more products, to fill inventory.
- The ability to see into the past, and to see into the future. The emerging approaches to predictive analytics allow businesses to finally see into the future. Also, to see what went truly right and truly wrong in the past.
While data integration technology has been around for decades, most businesses that both manufacture and distribute products have not taken full advantage of this technology. The reasons range from perceptions around affordability, to the skills required to maintain the data integration flow. However, the truth is that you really can’t afford to ignore data integration technology any longer. It’s time to create and deploy a data integration strategy, using the right technology.
This survey is just an instance of a pattern. Data integration was considered optional in the past. With today’s emerging notions around the strategic use of data, clearly, it’s no longer an option.
Did you know Harrods introduces more than 1.7 million new products every year? This includes their own labels, as well as other brands. Recently, Peter Rush, the Harrods Solution Architect responsible for product information, spoke at Informatica’s MDM Day EMEA in London. At the event, he said there are:
“so many things we want to do: Product Information is at the heart of most of them.”
As part of the customer experience program, Harrods identified product information quality as a key asset, next to customer information management.
The Product Information Challenge Harrods was facing included the following:
- A Lack of a single Product data store
- Inappropriate Product Data objectives
- Massive scale and volume of products and brands (1.7 million new products per year)
- Concessions and Own Bought
- Localized enrichment
- Media Assets all over the estate
While discussing his product information management project, Peter gave a great and simple example. He showed the product descriptions below and asked, “Who knows which two products these are?”:
- XX 6621/74 BLK VN SS TOP 969B S
- XX37066 L/BLU PRK FLAN SH 440B MED
Then, he solved the mystery. The answer was this:
- Black V-neck sleeveless top
- Light blue parker print flannel shirt
Turning vision into reality needs a joint business and IT project
Peter said, it is important to build a “flexible team to meet needs of each project stage, with representation from key business areas”. The team should include representatives from groups like: Merchandise Data, Buying Team, Web Team, IT, CRM and the Shopfloor Team. In addition to their Core Project Team, Harrods defined a Steering Committee and a group of selected Super Users.
Benefit summary: a combination of people, technology and process
At the end of the session, I was impressed by this graphic. This image sums up the essentials of product information management success. It is about the people, who are able to do the right things. It is about how technology enables automation. It is about the process which turns information into value.
Finally it is important to mention our partner Javelin Group is leading the PIM implementation at Harrods. Also Andy Hayler, analyst from The Information Difference, wrote an article for the CIO Magazine.
In 2012, Forbes published an article predicting an upcoming problem.
The Need for Scalable Enterprise Analytics
Specifically, increased exploration in Big Data opportunities would place pressure on the typical corporate infrastructure. The generic hardware used to run most tech industry enterprise applications was not designed to handle real-time data processing. As a result, the explosion of mobile usages, and the proliferation of social networks, was increasing the strain on the system. Most companies now faced real-time processing requirements beyond what the traditional model was designed to handle.
In the past two years, the volume of data and speed of data growth has grown significantly. As a result, the problem has become more severe. It is now clear that these challenges can’t be overcome by simply doubling or tripling their IT spending on infrastructure sprawl. Today, enterprises seek consolidated solutions that offer scalability, performance and ease of administration. The present need is for scalable enterprise analytics.
A Clear Solution Is Available
Informatica PowerCenter and Data Quality is the market leading data integration and data quality platform. This platform has now been certified by Oracle as an optimal solution for both the Oracle Exadata Database Machine and the Oracle SuperCluster.
As the high-speed on-ramp for data into Oracle Exadata, PowerCenter and Data Quality deliver up-to five times faster performance on data load, query, profiling and cleansing tasks. Informatica’s data integration customers can now easily reuse data integration code, skills and resources to access and transform any data from any data source and load it into Exadata, with the highest throughput and scalability.
Customers adopting Oracle Exadata for high-volume, high-speed analytics can now be confident with Informatica PowerCenter and Data Quality. With these products, they can ingest, cleanse and transform all types of data into Exadata with the highest performance and scale required to maximize the value of their Exadata investment.
Proving the Value of Scalable Enterprise Analytics
In order to demonstrate the efficacy of their partnership, the two companies worked together on a Proof Of Value (POV) project. The goal is to prove that using PowerCenter with Exadata would improve both performance and scalability. The project involved PowerCenter and Data Quality 9.6.1 and x4-2 Exadata Machine. Oracle 11g was considered for both standard Oracle and Exadata versions.
The first test conducted a 1TB load test to Exadata and standard Oracle in a typical PowerCenter use case. The second test consisted of querying 1TB profiling warehouse database in Data Quality use case scenario. Performance data was collected for both tests. The scalability factor was also captured. A variant of the TPCH dataset was used to generate the test data. The results were significantly higher than prior Exabyte 1TB test. In particular:
- The data query tests achieved 5x performance.
- The data load tests achieved a 3x-5x speed increase.
- Linear scalability was achieved with read/write tests on Exadata.
What Business Benefits Could You Expect?
Informatica PowerCenter and Data Quality, along-with Oracle Exadata, now provide the best-of-breed combination of software and hardware, optimized to deliver the highest possible total system performance. These comprehensive tools drive agile reporting and analytics, while empowering IT organizations to meet SLAs and quality goals like never before.
- Extend Oracle Exadata’s access to even more business critical data sources. Utilize optimized out-of-the-box Informatica connectivity to easily access hundreds of data sources, including all the major databases, on-premise and cloud applications, mainframe, social data and Hadoop.
- Get more data, more quickly into Oracle Exadata. Move higher volumes of trusted data quickly into Exadata to support timely reporting with up-to-date information (i.e. up to 5x performance improvement compared to Oracle database).
- Centralize management and improve insight into large scale data warehouses. Deliver the necessary insights to stakeholders with intuitive data lineage and a collaborative business glossary. Contribute to high quality business analytics, in a timely manner across the enterprise.
- Instantly re-direct workloads and resources to Oracle Exadata without compromising performance. Leverage existing code and programming skills to execute high-performance data integration directly on Exadata by performing push down optimization.
- Roll-out data integration projects faster and more cost-effectively. Customers can now leverage thousands of Informatica certified developers to execute existing data integration and quality transformations directly on Oracle Exadata, without any additional coding.
- Efficiently scale-up and scale-out. Customers can now maximize performance and lower the costs of data integration and quality operations of any scale by performing Informatica workload and push down optimization on Oracle Exadata.
- Save significant costs involved in administration and expansion. Customers can now easily and economically manage large-scale analytics data warehousing environments with a single point of administration and control, and consolidate a multitude of servers on one rack.
- Reduce risk. Customers can now leverage Informatica’s data integration and quality platform to overcome the typical performance and scalability limitations seen in databases and data storage systems. This will help reduce quality-of-service risks as data volumes rise.
Oracle Exadata is a well-engineered system that offers customers out-of-box scalability and performance on demand. Informatica PowerCenter and Data Quality are optimized to run on Exadata, offering customers business benefits that speed up data integration and data quality tasks like never before. Informatica’s certified, optimized, and purpose-built solutions for Oracle can help you enable more timely and trustworthy reporting. You can now benefit from Informatica’s optimized solutions for Oracle Exadata to make better business decisions by unlocking the full potential of the most current and complete enterprise data available. As shown in our test results, you can attain up to 5x performance by scaling Exadata. Informatica Data Quality customers can perform profiling 1TB datasets, which is unheard before. We urge you to deploy the combined solution to solve your data integration and quality problems today while achieving high speed business analytics in these days of big data exploration and Internet Of Things.
Listen to what Ash Kulkarni, SVP, at OOW14 has to say on how @InformaticaCORP PowerCenter and Data Quality certified by Oracle as optimized for Exadata can deliver up-to five times faster performance improvement on data load, query, profiling, cleansing and mastering tasks, for Exadata.
When it comes to cloud-based data analytics, a recent study by Ventana Research (as found in Loraine Lawson’s recent blog post) provides a few interesting data points. The study reveals that 40 percent of respondents cited lowered costs as a top benefit, improved efficiency was a close second at 39 percent, and better communication and knowledge sharing also ranked highly at 34 percent.
Ventana Research also found that organizations cite a unique and more complex reason to avoid cloud analytics and BI. Legacy integration work can be a major hindrance, particularly when BI tools are already integrated with other applications. In other words, it’s the same old story:
The ability to deal with existing legacy systems when moving to concepts such as big data or cloud-based analytics is critical to the success of any enterprise data analytics strategy. However, most enterprises don’t focus on data integration as much as they should, and hope that they can solve the problems using ad-hoc approaches.
You can’t make sense of data that you can’t see.
These approaches rarely work as well a they should, if at all. Thus, any investment made in data analytics technology is often diminished because the BI tools or applications that leverage analytics can’t see all of the relevant data. As a result, only part of the story is told by the available data, and those who leverage data analytics don’t rely on the information, and that means failure.
What’s frustrating to me about this issue is that the problem is easily solved. Those in the enterprise charged with standing up data analytics should put a plan in place to integrate new and legacy systems. As part of that plan, there should be a common understanding around business concepts/entities of a customer, sale, inventory, etc., and all of the data related to these concepts/entities must be visible to the data analytics engines and tools. This requires a data integration strategy, and technology.
As enterprises embark on a new day of more advanced and valuable data analytics technology, largely built upon the cloud and big data, the data integration strategy should be systemic. This means mapping a path for the data from the source legacy systems, to the views that the data analytics systems should include. What’s more, this data should be in real operational time because data analytics loses value as the data becomes older and out-of-date. We operate a in a real-time world now.
So, the work ahead requires planning to occur at both the conceptual and physical levels to define how data analytics will work for your enterprise. This includes what you need to see, when you need to see it, and then mapping a path for the data back to the business-critical and, typically, legacy systems. Data integration should be first and foremost when planning the strategy, technology, and deployments.
Amazon Redshift, one of the fast-rising stars in the AWS ecosystem has taken the data warehousing world by storm ever since it was introduced almost two years ago. Amazon Redshift operates completely in the cloud, and allows you to provision nodes on-demand. This model allows you to overcome many of the pains associated with traditional data warehousing techniques, such as provisioning extra server hardware, sizing and preparing databases for loading or extensive SQL scripting.
However, when loading data into Redshift, you may find it challenging to do so in a timely manner. To reduce the time taken to load this data, you may have to spend a tremendous amount of time writing SQL optimization queries which takes away the value proposition of using Redshift in the first place.
Informatica Cloud helps you load this data quickly into Redshift in just a few minutes. To start using Informatica Cloud, you’ll need to establish connections from Redshift and your other data source first. Here are a few easy steps to help you get started with establishing connections from a relational database such as MySQL as well as Redshift into Informatica Cloud:
- Login into your Informatica Cloud account, go to Configure -> Connections, click “New”, and select “MySQL” for “Type”
- Select your Secure Agent and fill in the rest of the database details:
- Test your connection and then click ‘OK’ to save and exit
- Now, login to your AWS account and go to Redshift service page
- Go to your cluster configuration page and make a note of the cluster and cluster database properties: Number of Nodes, Endpoint, Port, Database Name, JDBC URL. You also will need:
- The Redshift database user name and password (which is different from your AWS account)
- AWS account Access Key
- AWS account Secret Key
- Exit the AWS console.
- Now, back in your Informatica Cloud account, go to Configure -> Connections and click “New”.
- Select “AWS Redshift (Informatica)” for “Type” and fill in the rest of the details from the information you have from above
- Test the connection and then click ‘OK’ to save and exit
As you can see, establishing connections was extremely easy and can be done in less than 5 minutes. To learn how customers such as UBM used Informatica Cloud to deliver next-generation customer insights with Amazon Redshift, please join us on September 16 for a webinar where we’ll have product experts from Amazon and UBM explaining how your company can benefit from cloud data warehousing for petabyte-scale analytics using Amazon Redshift.