David Lyle

David Lyle
Informatica R&D VP in the Office of the CTO, author, speaker, evangelist on ICCs and how “Lean” can help deliver timely, trusted, and relevant data more quickly for IT’s customers.

What you need for modern Data Integration (Hint: it’s not just batch or ETL)

While today’s modern landscape is now more hybrid and cloud for most customers, and the lines are starting to blur between application and data integration, it is also clear that “data remains data” whether it’s in the cloud or on-premise. Every organization needs data integration to run their business.

Modern Data Integration

What do you need for Modern Data Integration

Data Integration is not the same as batch processing. Transformations required to translate one message format to another (e.g. what an ESB does) are not equivalent to set-based algorithms required for integration of heterogeneous data. Concatenating First_Name and Last_Name together is not data integration. I could use shell scripting and ftp to process data in parallel. Data integration is much more than batch processing or Extract, Transformation and Load (ETL).

So, what are the key core capabilities of data integration?


“Transformation” is not multiplying Unit_Price times Quantity. Transformation is fuzzy joining heterogeneous data in parallel using in-memory caching technology. Transformation is adding address cleansing to that same data pipeline. Transformation is handling the ragged hierarchies of nested documents or handling the natural language or sentiment processing of unstructured data. And once defined, there are an increasing number of engines to execute these transformations: an RDBMS like Teradata or Oracle, a massively parallel clustering framework like Hadoop, or a grid of Informatica services. Using a metadata-driven, visual diagramming environment and Vibe that can “Map Once, Deploy Anywhere”, Informatica has separated the “what” transformations are desired from the “how” and “where” those transformations are executed, using an intelligent optimizer to push the work to the engine with the most optimized capabilities, all without requiring Map/Reduce, BTEQ, PL/SQL, Java, Python, or other skills.


Today, data integration must be possible at any latency: real-time, event-driven, or batch. Once we’ve cleansed and conformed a logical definition of data, we would like to make this data available anytime, anywhere, controlled by security and not tool capability. True data integration should allow Change Data Capture, or CDC, whenever helpful. Wouldn’t we rather move 1/1000th of the amount of data by processing only changes rather than moving everything every time? Constructing integration services of changed data for any latency using common data transformation definitions without altering source applications or configuring database triggers allows the most efficient data movement and change management possible.

Metadata and Transactional Intelligence

The statistical knowledge of the data realities is so important to the people collaborating to solve data integration problems that Informatica put this visibility in the ‘workflow’ of SaaS administrators, business analysts, data stewards, architects, and developers. Intelligence about data domains, inference about relationships, and knowledge about lineage and impact analysis: these are capabilities that computers excel at. With intelligent data integration, users’ knowledge is augmented by metadata and data knowledge to make integration and business/IT collaboration much more agile. Knowledge of what constitutes transactions in applications, how master data is modeled, or how logs or data interchange formats are defined are crucial to reliably handling the complexities of enterprise data. Additionally, this metadata intelligence enables business users to directly access the appropriate data that they need, when they need it.

Self-Service + Governance

Thanks to the cloud, the business has many more options to directly take control of the data that they need and modern data integration solutions need to fully support business self-service, without IT’s direct involvement. Complementary self-service data integration capabilities within cloud and hybrid IT landscapes offers the agility the business requires and the governance that IT desires.

Rather than central IT serving as the gatekeeper for all data progress, self-service integration capabilities for SaaS admins and business analysts can that log, audit and govern the behavior of line-of-business users also empower more people to participate while supporting corporate objectives of data security and adherence to regulatory requirements. In addition, with cloud apps such as Salesforce.com becoming mission critical business app hubs for many companies, data integration needs to be enabled within the context of the cloud app itself, creating a seamless user experience for SaaS users.

A data integration solution that doesn’t support native UI generation for SaaS apps such as Salesforce.com, creates a costly and time consuming overhead for IT to write custom code and to build bespoke UI’s. Through this combination of simplifying data access for business users and yet with improved governance and oversight, we can achieve faster and better data integration for the business.

Unified Data and Process Integration

Data and process integration need to work seamlessly together, it’s not just a blurring. A unified data and process integration solution provides the best of both worlds and needs to support long-running transactions, human workflow, business self-service and template-driven extensibility.

Managed Master Data

Managed master data can glue the transactions, events and interactions of an enterprise together to create a single view. Managed reference data describes the processes and classifications of master data and transactions across your organization. And managed metadata helps you see what you have, where you have it, when it changed, and who is responsible for it. A real data integration solution manages all of these together, allowing us to understand the interconnections to manage data complexity, change implications and quality issues.

Data integration is much more than batch processing. Data integration is more than ETL. The value of data is now headline news in the mainstream press, but the variety, complexity, and pace of change of data is greater than ever. A modern, complete data integration solution helps you harness the vast volumes of data across on-premises, cloud, social, mobile and other devices to power your business and better meet your customer needs in the most agile and efficient way possible.

Posted in Data Integration | Tagged | Leave a comment

Business and Technology Requirements that are Driving User Organizations Toward True Real-Time Analytics and Reporting

Can you imagine data being less important in the future?  While organizations keep many data success stories out of the general press for competitive reasons, there are plenty of success stories out there and they make for interesting reading, for example the New York Times article How Companies Learn Your Secrets

Data volumes are increasing, and the types of data we wish to analyze is becoming more varied.  On the one hand, we need to process data faster.  On the other hand, we have more data to process.  What to do?

Ralph Kimball defined “real-time” as “faster than your current ETL architecture can deliver your data”.  In the same way, I’d define Big Data as “more than your current analytic architecture can store and process”.

Smart meters, clinical trials, call centers, complex supply chain operations, logistics, changing risk exposure: the desire to be able to visualize important business processes in an up-to-the-minute fashion to make better decisions is becoming more and more important.

So, what great thing would you attempt if you knew you could not fail?

With improvements in technology and architectural approaches, most of us are held back our pre-conceived knowledge of historical limitations.  New technologies and approaches are allowing us to solve some old problems, but everything has limitations.  Where are the current limitations and how are those limitations changing?  Where are the current opportunities?

I find that in discussions with many organizations, the difficulty is in imagining the great things that one might try to attempt.  If you were able to ask any business question, or have any business knowledge at all, what would that be?  Within the realms where data in some form, anywhere, exists: what would you ask?  I’ve found that the organizations making the most progress in operational intelligence are the companies using the most imagination, both in terms of the business questions they are asking, and the technically new architectural approaches they are taking.

Every layer of the traditional data warehouse architecture has been affected by improved approaches in the last 10-15 years, allowing us to tackle both operational, tactical and strategic intelligence questions.  After 20 years of decision support progress and new tools, real foundations have been laid for how to architect things differently, and how to collaborate differently as between business and IT.

This is the first in a series where I’ll mostly be exploring what those technically new architectural approaches are.  If I had an opportunity to re-architect my old data warehouses using newer tools and approaches, with the knowledge of successful patterns I’ve developed over the years, I’d approach decision support and operational intelligence very differently.  For instance, I’d populate my ODS differently.  I’d use CDC differently.  I’d use checksums differently.  I’d use metadata, parameterization, templates, etc., differently.  The benefit would not ONLY be the ability to have operational reporting and intelligence:  I would also be able to do much more better, faster, and cheaper:  without compromise.

Posted in Big Data, Business Impact / Benefits, Business/IT Collaboration | Leave a comment

Attention Developers! Pragmatic Topics for Your Jobs

Where is the love for Developers?   There is plenty of information for high level, abstract discussions on data and integration.  But where are the useful Developer details?

Where are the inside tips?  Where is the intermediate and advanced information not covered in classes, articles, blogs and white papers?  Is it any coincidence that Developer-oriented breakouts at Informatica World are standing-room only?  Developers are hungry for useful information we can use immediately.  The fun of being a “Developer” is the continual learning that is possible, but it can frequently be tough to find information that is deep enough to push along our knowledge.

My primary goal in this community is to help you to become an even better, more valuable Developer, with additional skills assisting your possible architecture, data analyst, and leadership roles to help to further your career.

We announced our new Potential at Work Community at Informatica World in early June.  You can read Jakki Geiger’s blog introducing the Community to learn more about the goals for this great resource.  We created six role-based communities including Developers, Application leaders, Architects, Information leaders, IT leaders and Sales and marketing leaders.  Please join the community representing your current role, as well as any other communities you are interested in.

So this Developer community is for you.  I will be gathering together some of the most interesting people, thought-provoking ideas, and innovative thinking to help you in your careers.  I will be covering advanced areas that will be useful to you whether you consider yourself a Mapping Developer, Tuner/Optimizer, Administrator and/or Operator.  Additionally, in order to be a top Developer, we will have to cover architectural implementation topics, metadata, profiling, data modeling, etc.  All the good stuff.

Be sure to sign up now and we will send along topics that are devoid of marketing fluff and filled with useful new tips.

Posted in Uncategorized | Leave a comment

Manufacturing Best Practices Have Changed Next-Generation Data Integration

Over the course of the last century, manufacturing has improved from individual craftsman to Ford’s assembly lines to Toyota’s Production System.  In particular, Toyota’s Production System has been generalized to be used in all other business areas, from Health Care to Retail to Financial Services to Public Sector organizations.  By incorporating the thought processes that guide the continuous improvement programs and repeatable best practices found in optimized manufacturing organizations, all kinds of business processes can eliminate the waste and deliver higher customer value more quickly, more cheaply, and with lower risk and higher quality. (more…)

Posted in Data Integration | Leave a comment

Big Data: Changing How Business and IT Interact

Most of the big data discussions have been on the technology or the numerously re-played business discoveries used as examples of big data’s power. Many companies are still in the experimental stages of big data, asking for guidance regarding what their benefits would be, how they can re-align themselves to take advantage, and what new processes might be helpful to make them successful with these powerful new capabilities. (more…)

Posted in Big Data | Tagged , , , , , | 1 Comment

SOA’s Last Mile Part III: How to Address SOA’s Data-Centric Pitfalls Effectively

This blog post is part two of an ongoing series highlighting the importance of data in a Service-Oriented Architecture (SOA). I look forward to hearing your thoughts and input on the subject.

I’m back. It’s been a little longer than normal, longer than I would have liked. Perhaps that’s because ‘addressing SOA’s data-centric pitfalls’ isn’t easy. (Really it’s because I’ve been working on other things. But let’s get back to the topic at hand.)

One of the benefits of the SOA approach is the ability to think top-down about problems. The usual approach is to work tightly with the business to define your processes from a business perspective, leading to clearly defined services that the business understands and you can implement together.

This is wonderful and has a clarifying symmetry that Software Engineering has been trying to achieve since the days of CASE. But now, here we are in 2008 with the SOA standards defined and the tools available to potentially achieve this vision. Ah, finally, the integration hairball will be contained and life will improve immeasurably for all!

But as I talked about last time, one of the reasons that things aren’t that simple is the data-centric pitfalls. And addressing this problem is not easy if you want to take a long-term, enterprise-oriented approach.

In talking with folks who have walked down this path, struggled with data problems, and are trying to think holistically about a workable longer-term solution, three themes come up again and again: (more…)

Posted in Data Quality, Data Services, Enterprise Data Management, Governance, Risk and Compliance, Integration Competency Centers | Tagged , , , , | Leave a comment

SOA’s Last Mile Part II: SOA’s Hidden Data-Centric Pitfalls

This blog post is part two of an ongoing series highlighting the importance of data in a Service-Oriented Architecture (SOA). I look forward to hearing your thoughts and input on the subject.

Last posting, I ranted about the fact that ‘data’ is finally a topic of discussion with respect to SOA initiatives. SOA provides business services that at their deepest level interact with data. What are the data-centric pitfalls that SOA can run into?

First off, data has meaning. While an enterprise ‘meaning’ can be presented by the services to outside consumers of those services, someone has to deal with the fact that the foundational business systems may have different meanings for the underlying data. The ‘transformation’ is frequently very important and complex.

Secondly, the meaning of data can change over time as the business changes. These changes will impact the services and the ‘transformations’ mentioned above. And sometimes these changes will affect the users of the services.

Thirdly, the quality of data is not perfect. How do you deal with these imperfections?

Fourthly, the systems of record for data are not usually neatly compartmentalized. At most complex enterprises, there isn’t just one Order Management system, or one HR system. The concepts of Customer, Policy, Employee, etc., can be spread across many heterogeneous systems, with overlapping responsibilities.

I’m sure there’s a fifth, a sixth, etc. But let’s just elaborate on these four.

Let’s start with the meaning of data. For example, a business term like Customer Level defines the historical importance this customer has to a company. The different values for Customer Level may be ‘Gold’, ‘Silver’, and ‘Bronze’. Each of these values has its own business definition. For instance, a ‘Gold’ customer might be any customer that has an average balance of over $5000 with us and has been a customer for over 8 years. This definition of ‘Gold’ is part of the meaning of the data. And ‘Gold’ may not physically be stored in any database, legacy or ERP system. ‘Gold’ may be arrived at whenever someone looks up a customer. But whether stored or calculated, the instructions for how ‘Gold’ is found might be very complex. My example is only a simple one.

Let’s continue assuming that this meaning changes over time. Let’s say the definition of a ‘Gold’ customer now involves looking at their credit status. Consumers of this information may not be affected, but the logic for delivering or transforming this value within the service logic would change. Further, let’s say that the business decided to add a new level called ‘Platinum’. This could have a ripple effect not just on the logic for how Customer Level is calculated or delivered as a whole, but more importantly it can have an impact on the consumers of services that use this Customer Level information. Once again, I’m using a simple, crude example that undersells the complexity of this problem. The reality of data, its meaning, and how often it changes is a significant challenge.

And what if the quality of the data was bad and the Customer Level could not be determined? What if the stored value of Customer Level for Acme Corp. was ‘Wood’ or non-existent? This is reality. Frequent SOA implementations and vendor demos seem to ignore this point. Unit test and demo data is always perfect and clean. We all know that the reality of business data is not so pretty. How do you handle bad data and how do you work with the business to continually improve data quality?

Finally, rarely is enterprise data neatly compartmentalized into single systems with non-overlapping business coverage. SOA can hide the location of Customer, Policy, Product, Orders, etc. to service consumers, but someone had to figure out how to make multiple, heterogeneous Order Management systems appear as one system. Someone had to create a mighty complex ‘DNS for data’ system to hide this complexity.

Many factors are making this problem worse unless data is given direct attention:

• Rising complexity of data – IT organizations are now handling more data, in more formats, from more partners, and more systems than ever before.

• Increasing business demands – Timely information fuels all the business initiatives that SOA was designed to support. SOA must be able to make data available when and how the business needs it—in batch, near real-time and real-time modes.

• Shrinking IT budgets – Every business initiative spawns a new IT project. And each IT project requires data integration. SOAs do not help IT organizations re-use data integration logic and skills across these projects to keep IT costs in check if data integration logic is buried in java within service code.

• Proliferating data quality issues – As complexity and agility increases, data quality ‘entropy’ increases. The more data quality is ignored, the worse the problem gets.

In this posting, I’ve talked about the data-centric pitfalls in SOA without talking about the solutions. How can these data-centric pitfalls in SOA be seamlessly handled? Read all about it in the next post.

Next up “SOA’s Last Mile Part III: How to Address SOA’s Hidden Data-Centric Pitfalls Effectively”


Posted in Data Integration, Data Services | Leave a comment

SOA’s Last Mile, Part I: Data is a Common Theme in SOA

This blog post is part one of an ongoing series highlighting the importance of data in a Service-Oriented Architecture (SOA) and in Business Process Management (BPM). I look forward to hearing your thoughts and input on the subject.

In 2005, I attended several SOA conferences and tried to discuss ‘data’ with attendees and vendors. Most people looked at me quizzically then ignored the topic, saying that SOA will abstract away concerns about data types, formats, location, and such. While some nodded about the importance of data semantics, there was little appreciation of the fact that without some kind of ‘data abstraction layer’ for services to utilize, everyone will end up solving the same data access, cleansing, transformation, semantic translation, and integration problems again and again, this time within java code buried within the services themselves, creating a complex, new ‘Integration Hairball’. Ouch!

But now, almost three years later, data is front and center. With new technologies, people seem to realize that this new ‘Integration Hairball’ will be created in a fraction of the time it took to create the existing, pre-SOA hairball, unless proper approaches to the ‘data problem’ are taken into account with respect to people, processes and technology around data utilized in the SOA initiatives.

Without taking the data into account from the beginning, SOA is just the next evolution of CORBA, COM, client/server, etc. Certainly, SOA can have benefits by itself, but it’s necessary to recognize that it isn’t complete without having a plan to manage the ‘data problem’.

SOA adoption has been accelerating, as can be observed from the various IT initiatives that involve using SOA approaches for delivering a more flexible and responsive IT infrastructure. According to a survey by IDG Research , 91% of CIOs are planning, evaluating or piloting SOA projects. That certainly reflects strong optimism about the ultimate benefits of SOA, even if it means investing months or years in development and testing to get there.

Click to Enlarge

Figure – “Data” is the Common Theme across Many SOA Initiatives (Source: CIO2CIO Study Program, IDG Research, April 17, 2007)


According to the survey conducted by IDG Research, “data” seems to be the common theme across various SOA projects in the enterprise:

• Business flexibility and reusability of software assets and services built on a foundation of quality data

• Project scope driven by the extent to which data and applications touch the entire business

• SOA “readiness” involving more upfront work, namely testing and data quality assurance, than anticipated

SOA does represent a fundamental change in the way businesses are run, a move from rigid, tightly-coupled systems to loosely-coupled systems where applications are built more successfully using reusable components. Thus, for those building the services for others to consume, they must know the location, format, structure, quality, usage and context of data, even if they’ll be abstracting away those complexities for others. Simply put, when it comes to SOA, data management and integration have never been more important.


I believe that the pay-off for SOA will be great if implemented carefully and correctly with the appropriate up-front emphasis given to the data. More controversially, I believe SOA will be a failure WITHOUT emphasis on the data. Obviously, data is not the only, nor perhaps the most, critical factor to SOA success. I’m just saying that it’s an important one that must not be ignored.

Your thoughts? Dive in with your experiences.

Next up “SOA’s Last Mile Part II: SOA’s Hidden Data-Centric Pitfalls”


Posted in Data Integration, Data Services | Leave a comment