Data Integration - Informatica

Informatica Data Quality

Blog Update For Current Subscribers

Informatica has launched the Informatica Perspectives blog (RSS) where you can now find the latest Data Quality discussions among other topics. Please update your RSS subscription to track the following RSS feed for the latest blog posts on Data Quality.

Thanks,

The Informatica Team

National Security vs. Privacy Rights - the Role for Technology

Ivan Chong

I ran across an interesting article concerning the US initiative to broker data exchange with various EU nations. The intent is to gain greater access to information that would help in the global war on terror.

European governments are entering into these agreements much more readily than they were four, five years ago, because concerns about terrorism are no longer confined to one side of the Atlantic.

The article then highlights the concerns over violation of personal privacy rights and the potential for abuse.

The agreement, which was described by two European officials, also allows for the transmission of "personal data revealing racial or ethnic origin, political opinion or religious or other beliefs, trade union membership or information concerning health and sexual life" in cases where they are "particularly relevant to the purposes of this agreement." It defines personal data as "any information relating to an identified or identifiable natural person."

The technology challenge can often be so consuming that we devote scarce attention to the ethical issues involved. Data integration and identity resolution technology are continually advancing. By factoring in ethical and moral considerations into the development of the technology, we should be able to support both objectives. Privacy and security do not necessarily need to be requirements that trade off against each other. In terms of identity resolution, the technology easily supports masking of personal attributes. Match results can be delivered independent of the conditions which trigger the match. Personal data used for matching can be stored in a transient manner and safeguarded against open access. etc. etc. I'm sure we can debate the efficacy of the technology towards these objectives. But at least, we should include technology in the debate.

Data Governance at MIT

Ivan Chong

Just gave a presentation at MIT's Information Quality conference hosted at the Sloan school of management. Data Governance largely deals with softer topics like people, organizational strategies, and processes.  Not necessarily technology.  The irony was not lost on anyone that this presentation given at MIT stressed that technology alone would not solve a company's data quality problems.

It was a real privilege and honor for me to return as a lecturer to some of the same classrooms I attended as a student. MIT's Sloan school is right next to the Media Lab where I did undergraduate research some twenty years ago.  The most profound takeaway from my time as an engineering student was the notion that technology alone could not solve hard problems.  Back in 1986, we were experimenting with sending images and video over the network and the prof's were always stressing that social and organizational considerations factored heavily into technology adoption.  This may sound obvious to grizzled IT veterans, but to the wide-eyed geeks studying at MIT, this came as quite a revelation.  Certainly, this is the underlying driver behind Data Governance - it's a necessary framework so the enterprise can leverage and apply data quality, data integration, and metadata management technology.

The presentation covered several case studies involving successful customer deployments of enterprise-wide data governance programs.  Many of the attendees commented that they found it necessary to gain initial wins on tactical projects so they could gain credibility and navigate the political issues behind an enterprise deployment. There was certainly some really vigorous discussion and debate on this topic.

What experience have you had with implementing a data governance program?  Just like these MIT students, feel free to share your opinions with us.

Microsoft Buys Zoomix

Ivan Chong

It's been rumored for a while, but now it is official - Microsoft has announced an agreement to buy a data quality startup company, Zoomix, for the purpose of enhancing SQL Server.

Microsoft plans to add Zoomix's technology to future releases of its SQL Server database, the company said through its public relations firm. Zoomix said its development team will join the SQL Server team at Microsoft's research and development center in Israel.

While this is not a large transaction for Microsoft, the move does underscore the importance of Data Quality. However, this raises an interesting question. Who should you trust to deliver data quality? The people who brought you Vista? the folks who sold you SAP? At first glance, it seems quite convenient to be able to deal with data quality issues in conjunction with specific source systems. However, many IT experts would claim this approach is merely a stop-gap measure. Data must be managed apart from its host systems. Data Quality rules start to truly add value to the business when they span MS SQL Server, and SAP, and Oracle, and etc. etc. It's still a topic of debate. But the discussion has moved beyond the question of "is data quality software useful?" to "where is the most useful place to deliver data quality software?"

Feel free to post your opinions!

Can Data Quality Solve World Hunger?

Ivan Chong

If you ever find yourself discussing the benefits of data quality for your business and one of your associates asks rhetorically, "Yes, but can it solve world hunger?" you now have an answer for them.

FAO

The Food and Agriculture Organization of the United Nations records the level of completeness for data collection from each member nation. On their website, their stated mission is to work towards "a world without hunger." A key element in their fight against hunger is the FAO Stat database and a key means of maintaining the efficacy of the data is their data quality dashboard.

For organizations working with the FAO, it's important that the data be accurate - otherwise perishable goods may be wasted by getting shipped to locations not suffering from malnourished populations. This example highlights something that I've seen very often in the context of enterprise data quality initiatives. Many prospective customers come to us and ask "how do we get started, given the complexities of coordinating across multiple organizations inside our company?" Within the Informatica customer base, there are many examples of successful initiatives starting off with Data Quality metrics and dashboards. The metrics offer a great way for organizations to maintain a dialog on how to prioritize their investment in data quality.

Already, I've received email comments on my posting. "Can Data Quality allow us to live longer? Facilitate the exploration of outer space?" Great questions… stayed tuned for future postings!

Information Presentation Quality

Larry English

"Information Presentation Quality Characteristics"

This blog is the third and last of a series of blogs on the critical-to-quality characteristics of information quality required to achieve Total Information Quality Management. For information to have quality to knowledge workers:

  • It must be clearly defined so knowledge workers understand its meaning
  • It must be complete, accurate, and consistent across all data stores
  • It must be accessed and presented in a timely basis, and in an unbiased way that reveals the truth, so that the knowledge workers can take the right action or make the right decision

The last set of quality characteristics that knowledge workers require is presentation quality characteristics, which we discuss here.

  1. Information Product Specification Data Quality
  2. Information Content Quality
  3. Information Presentation Quality

It is a fatal mistake to measure only the quality of the data content to determine Information Quality. Many process and decision failures result from poor quality presentation of the information.

Presentation quality is part of the human-machine interface. Presentation quality characteristics represent the "look and feel" of the finished information product. These characteristics are not just the prettiness or flashiness of information presented, but represents the degree to which the information communicates the message in the data accurately and clearly to the information consumer so they can perform their work effectively.

Information Presentation Quality Characteristics:

The major information presentation (delivery or communication to information consumers) quality characteristics include:

A.1.1 Quality Characteristics of Information Presentation

Knowledge workers require different content quality characteristics based on their need for that information. Based on my work with dozens of clients, the major information presentation quality characteristics include:

  • Availability. Information is accessible when it is needed
  • Accessibility. Being able to get the information when needed
  • Presentation Media Appropriateness. Being presented in the right technology medium, such as online, hardcopy report, audio, or video
  • Relevancy. Information is appropriate for the task at hand, i.e., information required to perform a process or make a decision
  • Presentation Standardization. Formatted data is presented consistently in a standardized way across different media, such as in computer screens, generated reports, or manually prepared reports
  • Structured Values. Structured attributes like dates, time, telephone numbers, tax id numbers, product codes, and currency amounts should be presented in a consistent, standard way in any presentation. When numbers and identifiers are chunked, such as standard phone number formats (e.g., [1] (615) 837-1211) they are easier to remember and use
  • Structured Documents. Repeating reports should have a standard format with a style sheet that presents the information in a format that is consistent, easy to read, and easy to understand

Documents should use readability-enhancing techniques such as:

  • Information chunking
  • Use of simple words
  • Short sentences with active verbs
  • Bulleted items for lists
  • A readability index of three grade levels below the reading audience

Methods such as "Information Mapping" help improve readability of documents.

Presentation Clarity. Information is presented in a way that communicates the truth of the information. Clear labels, footnotes, other explanatory notes, references, or links to definitions and/or documentation that clearly communicate the meaning and any anomalies in the information enhance presentation clarity

Changes in data definition or in business rule specification can cause comparing information across time boundaries to be not accurate

Signage Clarity. Signs and other information-bearing mechanisms like traffic signals should be standardized and made universal across the broadest audience possible

Traffic signal lights are now standardized globally with red (stop), yellow (caution), and green (go) meanings. Furthermore, traffic signal lights have standard placements with red on top and green at the bottom for people with color-blindness, so that meaning is consistently associated with the position. The "redundancy" in this message system reduces error in those affected by color-blindness

Presentation Objectivity. Information is presented without bias, enabling the knowledge worker to understand the meaning and significance without misinterpretation

Numeric or quantitative data often requires graphical presentation. Objectivity means that the graphical or visual presentation of the information does NOT distort the truth as evidenced in the data

Presentation Utility. Information is presented in a way that is intuitive and appropriate for the task at hand. The presentation of information will vary by the individual uses for which it is required. Some uses require concise presentation, while others require a complete, detailed presentation, and yet others require graphics, color-coding, or other highlighting techniques

For more about Information Presentation Quality, see Chapter 6, "Assessing Information Quality," in Improving Data Warehouse and Information Quality. This contains a more comprehensive list of quality characteristics with examples. It also describes how to measure these quality characteristics.

What do you think? Share your experiences in measuring or improving information presentation quality.

Information Content Quality

Larry English

Information Content Quality Characteristics Larry English

One of the root causes of poor quality information is defects in the data definition, specifically the "information product specifications." Because information is a product of our business, manufacturing and service processes, the analogy of an "information product" is real, and the requirement for quality in "information product specifications" is a critical requirement for Information Quality.

This blog is the second of a series of three blogs on the critical quality characteristics (or measures) of information quality required on the TIQM Quality System.

  1. Information Product Specification Data Quality
  2. Information Content Quality
  3. Information Presentation Quality

Information Content Quality Characteristics

  • Information standards
  • Data names
  • Data definitions
  • Attribute valid value set or range of values
  • Value format for structured attributes (VIN, SSN, Product Codes)
  • Business rule specifications of constraints on data
  • Information Steward accountable for data definition quality

Information Content Quality Characteristics: The major information
content (data values) quality characteristics
include:

  • Definition conformance. Data values are consistent with
    the attribute (fact) definition
  • Completeness. Each process or decision has all the information
    it requires

    • Record completeness. A record exists for every real world object or event the enterprise needs to know about
    • Value completeness. A given data element (fact) has a value stored for all records that should have a value
  • Validity. Data values conform to the information product specifications
    • Value validity. A data value is a valid value or within a specified range of valid values for this data element
    • Business rule validity. Data values conform to the specified business rules
    • Derivation validity. A derived or calculated data value is produced correctly according to a specified calculation formula or set of derivation rules. If the base values are accurate, and the calculation is correctly performed, then result will be Accurate
  • Accuracy. Data values are correct.
    • Accuracy to surrogate source. The data agrees with an original, corroborative source record of data, such as a notarized birth certificate, document, or unaltered electronic data received from a party outside the control of the organization that is demonstrated to be a reliable source
    • Accuracy to reality. The data correctly reflects the characteristics of a real-world object or event being described. Accuracy and precision represent the highest degree of inherent information quality possible
  • Precision. Data values are correct to the right level of detail, such as price to the penny or weight to the nearest tenth of a gram
  • Non-duplication. There is only one record in a database representing a given real-world object or event
  • Source quality warranties/certifications. The source of information: (1) guarantees the quality of information it provides with remedies for non-compliance; (2) documents its certification in its information quality management capabilities to capture, maintain, and deliver quality information; or (3) provides objective and verifiable measures of the quality of information it provides in agreed-upon quality characteristics
  • Equivalence of redundant or distributed data. Data in one database is semantically equivalent to data about the same objects or events in another database
  • Concurrency of redundant or distributed data. The information float or lag time is minimal between (a) when data is knowable created or changed) in one database to (b) when it is also knowable in a redundant or distributed database, and concurrent queries to each database produce the same result

For more about Information Content Quality, see Chapter 6, "Assessing
Information Quality," in Improving Data Warehouse and Information Quality.
This contains a more comprehensive list of quality characteristics with examples.
It also describes how to measure these quality characteristics. The next blog
will discuss information presentation quality characteristics required for the
finished Information Product presented to the knowledge workers.

What do you think? Share your experiences in measuring information content
quality, especially accuracy.

2008 Magic Quadrant for Data Quality Tools and Informatica World Thoughts

Chris Cingrani

Gartner released the 2008 Magic Quadrant for Data Quality Tools on Wednesday, June 4th (Gartner, Inc. "Magic Quadrant for Data Quality Tools" by Ted Friedman, Andreas Bitterer June 4, 2008). In reading through the report, I was excited to see that Informatica was positioned in the Leaders Quadrant. If you haven't read the full report, I recommend heading to Informatica.com and requesting a complimentary copy, as it provides significant insight into many of the vendors in the space.

[Read more]

2008 Gartner Data Quality Magic Quadrant

Ivan Chong

Gartner just released this year's report on their Data Quality Tools Magic Quadrant, ranking Informatica right in the mix among vendors listed in the leaders quadrant. These vendors have been recognized leaders for many years. Some may ask "how is it that Informatica has grown so fast to be recognized as a leader in data quality only two years after entering the market?" Gartner cites many of the reasons, including significant adoption, strong data profiling, domain-agnostic data cleansing, ease of use and positive support and services experiences.

Another reason that has often been discussed by industry analysts is the convergence of the data integration and data quality markets. Customers benefit from a tremendous amount of synergy between data integration and data quality. Anyone doing a data migration, data consolidation, data warehouse, or MDM project would consider that project a complete failure if the data is not accurate and consistent. In order to deliver a compelling data quality product, it must be built on top of a comprehensive data integration platform. Customers can then achieve the levels of scalability, volume processing speed, real-time responsiveness, and near-universal connectivity that the best data integration products provide.

There may be a cluster of vendors in the leadership quadrant for the Data Quality MQ, but in the Gartner Data Integration Magic Quadrant, the choices are much clearer. When making strategic buying decisions, customers can simply look at the intersection of those two reports and quickly discovering which vendors offer the best products.

The Latest News and Updates from Informatica World

If you didn't make Informatica World 2008 this year be sure to check out the latest news, announcements, photos, videos, and more on the Informatica World 2008 blog. Several Informatica thought leaders are now live blogging this weeks events sharing their thoughts on this years event and product announcements. Take a look and welcome you to share your thoughts and/or questions.

Next,