Category Archives: Metadata
I am seeing a major change in thinking in the analytics world as people realize that if they do not straighten out their data strategy, they will never really be competitive with their analytics strategy. Here are two quick examples:
Two weeks ago, I attended a Chief Data Officer (CDO) event. I expected it to be all about the role of the CDO and why anybody would take the job if they truly understood what the role was. When I got there I could hardly find a CDO in attendance. It turned out that the conference was actually an Analytics conference that had been running for several years. The reason for recent change to the CDO title for the event was that they had all come to the realization that if they did not have a strategy to manage the delivery of trusted and timely data, the rest of the effort was irrelevant or even possibly damaging to their organizations.
A while back I read “Competing on Analytics” by Tom Davenport. It was a great and very thought provoking book for its time (2007) and I got a lot out of it. The one slight negative thing I remember about the book was that while the conversation about analytics for competitive advantage was very stimulating, the concept of delivering data to support that strategy did not appear until very late in the book (around page 155 actually). Fast forward to a very recent airplane flight where I read his follow-on book “Analytics at Work.” In this book Tom Davenport introduces his DART framework. Guess what the “D” in DART stands for? Right, Data! In this book, the second chapter is about data and the coverage of the subject is very thorough. (BTW: I highly recommend reading this book.)
Here are two great examples of a major change in thinking that I am seeing more and more frequently. Sure, I work for a Data company and tend to talk to people with a data perspective, but I am seeing and hearing this everywhere now.
It is clear, from an architect’s point of view, that we are paying for decades of project-based architecture with little thought to how data might be a shared resource rather than locked into a specific project or application. The movement to cloud applications has only accelerated that trend with business groups within an organization standing up applications and analytics in literally minutes.
What should people do to compete on data? I would recommend several things:
Every other major function is managed with a standard set of tools that ensure repeatable processes, skill reuse, and a degree of automation. It is time to do the same for enterprise data management. It is time to standardize your data management tools. You may not be able to do this 100% without stifling innovation, but a recent survey of architects by Informatica showed that the thought-leaders were planning significantly more standardization than the average architect. It’s the way to deal with data volume and complexity while increasing IT efficiency. Any other strategy is highly risky at this point.
- Design for data as a shared resource.All data onboarded should be prepped and managed in a way that ensures that the data is discoverable, usable, and manageable by any project that needs to use it, not just the current project.
- Design for business self-service.You need to enable your business users to discover and use trusted and timely data by themselves, and without any IT assistance. IT is struggling to meet the need for speed of business value delivery demanded by the business. One way to free up resources is to enable business self-service. It will also dramatically shorten the time required for a lot of business-lead projects that require data. Think of all those QlikView and Tableau users you have. What if they could discover and manage the data for themselves?
- Design for automation.Automation requires a standard platform for enterprise data management first. Next, it requires strong integrated metadata management across the platform. By collecting and understanding both technical and business metadata and matching that with actual user activity it is increasingly possible to create systems that automate routine data management tasks and provide intelligent recommendations for more complex tasks.
Let’s be clear. This is not just about tools. It is about Strategy, People, Processes, Technology (tools) and Metrics. In that exact order. If you don’t have a strategy and if that strategy is not aligned with the business strategy and goals, you will be going nowhere fast. The other items are equally important, but that will be another future blog.
For next-generation thinking on data architectures and management see Think “Data First” to Drive Business Value.
For a free download Informatica Rev, a tool that enables non-technical users to prepare data click here.
Recently, at Oracle OpenWorld 2014, many of our customers have told us they are migrating their Data Warehouse into Oracle Exadata. The driver behind this migration is the need to combine and analyze high volumes of data, at increased speeds. Our customers have said they need their Data Integration process to match the speed and performance of Exadata. And they want to scale cost-effectively.
We have good news for our customers moving to Exadata. We can do that for you! We just completed a benchmark with Oracle to certify PowerCenter as optimized for Exadata and SuperCluster. The benchmark used PowerCenter Advanced Edition to load a 1 TeraByte data table into Exadata. The results were conclusive and demonstrated a 5X performance improvement over traditional Oracle Data Warehouse!
With PowerCenter, you can now extract and transform data from hundreds of data-sources, from social to mainframe, and load it into Exadata with the highest data throughput and scalability. This ensures the right data is available in the Data Warehouse in a timely manner. Trusted Analytics is enabled at the speed of your business.
You may be worried about your previous investments in Data Warehousing? When you use PowerCenter to migrate to Exadata, your investment is protected. You can reuse your existing data integration mappings to load data from existing and new sources into Exadata. Just faster. And more of it. No additional coding required!
You can further harness the Exadata horsepower by utilizing PowerCenter Advanced Edition PushDown Optimization. This unique capability allows you to seamlessly process portions of your Data Integration directly on Exadata and further improve performance and scalability.
What else is in the secret sauce? PowerCenter Advanced Edition boasts advanced Scaling capabilities, such as High Availability with automatic fail-over, Enterprise Grid and Partitioning. This will allow you to extract, transform and load (ETL) high volumes of data at unheard of speeds. You can cost-effectively scale your ETL process, with increased performance, by simply adding PowerCenter CPUs and cores. You can increase overall scalability and performance of your Data Management system while optimizing Total Cost of Ownership (TCO).
You want to increase scale and speed of your Data Analytics, as data volumes grow and more projects and stakeholders emerge. It is critical to deliver data insights to stakeholders via data lineage and facilitate impact analysis by developers. PowerCenter Advanced Edition delivers Metadata Management tools and a collaborative Business Glossary to ensure high-quality business analytics across the enterprise. You can accomplish that without compromising speed and cost-efficiency.
Our benchmark also included a profiling query of over 1 TB data set. It was performed 5x faster using Informatica Data Quality. 1.5 million objects could be captured for data lineage. That is Big Data management for many of you. Done faster with quality and governance.
If you too, are thinking of migrating to Exadata, please think of PowerCenter as your on-ramp into Exadata. Together with Oracle, we can help you dramatically and cost-effectively increase the performance and throughput of your Data Warehouse. We can help you pave the way to the new age of high-speed Analytics. We enable you to expand Data Management investments to incorporate any new Data sources which come along. Our tools will help you bring order into the data chaos. The goal is to deliver the right data at the right time for your analysts to consume and analyze. The business demands no less.
If all you have is hammer, everything looks like a nail.
The concept known as the law of the instrument, Maslow’s hammer is an over-reliance on a familiar tool. It is convenient to look for the answers where it is easiest, i.e., in data, which begs the question –
Is Big Data overvalued?
The very term Big Data implies that quantity is paramount, and organizations collect it with the desire for accelerated returns. If your database grows by a 100 times, you expect to mine insights that are 100 times or even 300 times more accurate. From what I have seen, more of the same type of data provides only modest & incremental insights, not a quantum leap. Nothing in the data gathered, or in the way it is analyzed answers the questions you are asking. All that segmenting and clustering and scoring only adds to the already increasing mass of data.
Tasked with buying a cell phone, my siblings, with common genetic and environmental influences, will likely arrive at different consumption choices to mine.
If those closest to me exhibit different preferences, then why are these “previous customer,” i.e., strangers with no common nature or nurture to me being used to suggest products for me?
Big Data does not understand that consumers are bounded rational humans optimized over generations for “fight or flight.” In this complex and rapidly changing world, analytical models know very little about a customer’s present preferences and circumstances.
The circumstances of markets, like those of individuals, can change in an instant. Products sell out, forcing consumers to choose from what is available or to wait. Products stagnate. Promotions and discounts alter the relative attractiveness of one product compared with another, stimulating sales of one and depressing sales of another.
Think Small Data …
Focus on small data instead, i.e., product attributes and prices which change over time. This is the data consumers – your customers and your competitors’ customers – are using when choosing. To the extent of their ability, each consumer is assessing, comparing and evaluating the products and services on offer taking into account one’s own dynamically altering preferences over the attributes and one’s own changeable circumstances.
What you should be doing is maximizing the “willingness to pay,” to tap into the customer’s surplus. They will then tend to choose your product in preference to that of your competitors, depending on the bundle of attributes provided by your product. Analyzing data to reduce the error of estimation is not helping your customers to solve their problems – it is increasing them. The manifold combinations and permutations are adding to the burden, not reducing the load.
Customers will pay you for simply reducing the time they need to make a decision. Faced as they are with overwhelming choice, customers want up-to-date, reliable, valid, and trustworthy recommendations.
I just discovered this post Information Governance is more than just Data Governance by E.G. Nadhan. In general, the terms “Data” and “Information” have been used by many to mean the same thing. Nadhan raises some valid points which I will reinforce in this post – specifically, that as data management practices mature there is value in differentiating between data and information. I first wrote about this a few months ago in To Engage Business, Focus on Information Management rather than Data Management. This blog takes the next step to discuss the difference between Information Governance and Data Governance. (more…)
Are you in one of those organizations that wants one version of the truth so badly that you have five of them? If so, you’re not alone. How does this happen? The same way the integration hairball happened; point solutions developed without a master plan in a culture of management by exception (that is, address opportunities as exceptions and deal with them as quickly as possible without consideration for broader enterprise needs). Developing a master plan to avoid a data governance hairball is a better approach – but there is a right way and a wrong way to do it. (more…)
Here at Informatica we are gearing up for the Informatica World 2014, which will be held in Las Vegas from May 13 – 15. I am excited to report that our Data Quality team has lined up a great agenda for you at the show.
Throughout this 3-day conference there will be a dozen breakout sessions focused on variety of data quality topics. For those who are interested in knowing more about our Data Quality portfolio and getting their hands dirty, make sure you visit our hands-on labs at the show where you can learn from our Data Quality experts on some of the new capabilities we delivered in the latest release Data Quality 9.6. Also don’t forget to stop by the “Holistic Data Governance” booth at the Pavilion in between the sessions, you will find product demos and useful materials on the various offerings in Informatica Data Quality product family.
Below is a quick rundown of the data quality events at Informatica World’14:
Over a dozen of breakout sessions will be held throughout the 3-day conference. Your will hear from our product experts on topics such as How to Build a Holistic Data Stewardship through the Informatica Platform; What’s New in Informatica Data Quality and Address Doctor; Data Quality Tips and Tricks. You will also hear Informatica customers sharing their experiences on building holistic data quality and governance practice in their organizations.
Please take a look at the daily agenda for the Data Quality breakout sessions:
Tuesday, May 13
- 1PM – 2PM – Information Potential at Work Track – Holistic Data Stewardship through the Informatica Platform
- 2:15PM – 3:15PM – Information Potential at Work Track – Creative Uses of Data Quality and Data Services
- 4:45PM – 5:45PM – Platform & Products Track – What’s New in Informatica Data Quality and Address Doctor
Wednesday, May 14
- 11:30AM – 12:30PM – Information Potential at Work Track – One Year to Data Governance with Bank of New Zealand
- 2:30PM – 3:30PM – Information Potential at Work Track – Using IDE for Data On-Boarding Framework at HMS
- 5PM – 6PM – Information Potential at Work Track – Business Value of Data Quality
- 5PM – 6PM – Information Potential at Work Track – Establishing an Enterprise-Wide Data Quality Competency – A Financial Services Case Study
Thursday, May 15
- 9AM – 10AM – Developer and Admin Innovation Track – Data Quality Risk Management: CCAR, Basel II, and More
- 9AM – 10AM – Information Potential at Work Track – Building a Sustainable Data Governance Practice at Capital One
- 10:15AM – 11:15AM – Information Potential at Work Track – AddressDoctor : A low-Effort, High-Impact Boost for Global Business
- 11:30AM – 12:30PM – Developer and Admin Innovation Track – New Features in Informatica Data Quality
- 11:30AM – 12:30PM – Information Potential at Work Track – Data Governance Success Stories
- 2:30PM – 3:30PM – Developer and Admin Innovation Track – Automated Data Quality Scorecards for Data Migration
- 2:30PM – 3:30PM – Developer and Admin Innovation Track – Data Quality Tips and Tricks
Hands –on Labs
Informatica Data Quality product experts will be setting up 10 hands-on labs at the show to demonstrate and educate you on the key capabilities we offer through various offerings in Informatica’s Data Quality portfolio. Below is a glance on the lab sessions:
Tuesday, May 13, 11:30AM – 5:15PM
Wednesday, May 14, 11:30AM – 5:15PM
Thursday, May 15, 9AM – 2:45PM
Lab Sessions (session repeats every 45 minute. Please check the master Agenda for individual session time):
- Table 10 – Understand Your Data Process with Metadata Manager
- Table 11 – Establishing Common Understanding of Data with Business Glossary
- Table 19a – DQ/Profiling 9.6
- Table 19b – Exploring your Data and Data Relationships with Profiling and Discovery
- Table 20 – DQ/Profiling 9.6 Upgrade
- Table 22 – Ensuring Trusted Data with Informatica Data Quality
- Table 23 – Data Quality Reporting and Solutions
- Table 24 – Address Doctor
- Table 26 – Empowering Business users through Informatica Analyst
- Table 27 – Getting the Most out of your Data Integration & Data Quality Platform – Performance Tips & Tricks
While you are not attending the sessions, please stop by the “Holistic Data Governance” booth at the Pavilion to pick up the latest materials on Informatica Data Quality offerings and chat with Informatica Data Quality experts on how to start building holistic data quality process in your organization. Giveaways and party information can also be found at the booth. So come see us when you can!
We are looking forward to meeting you at Informatica World 2014!
Do you remember NASA’s $125 million mistake in 1999? The Mars orbiter was lost as the result of a failed information transfer in which one engineering team used metric units while another used imperial units.
I remember because I could relate. After moving to the U.S. from Canada for graduate school, I had to communicate my height in feet and inches instead of meters and centimeters and give directions in miles instead of kilometers.
On a trip to Vancouver, Canada, Andrew Donaher reminded me about NASA’s costly mistake and how it could have been avoided with a business-friendly data governance program. Following much positive feedback from our last blog, I invited Andy to discuss data governance. You may recall that Andy is the Director of Information Management Strategy at Groundswell Group, a Western Canadian consulting firm that specializes in information management services.
Q. According to www.governyourdata.com, data governance is not about the data. It’s about the business processes, decisions, and stakeholder interactions that you want to enable. What’s your take on the value of data governance?
A: The goal of data governance should be to give people confidence in the data they use to make decisions or take actions. They benefit by not wasting time and energy vetting data or creating new processes. That is a huge value to the organization both in terms of risk mitigation and opportunity. At the absolute highest level, data governance is critical to establish trust and confidence in data.
Q. Explain how IT leaders could approach data governance the wrong way.
A. Typically data governance is approached from a restrictive, security-focused and policing perspective. I have found it much more productive to approach it from an enablement, conversational and guiding perspective. The benefit and value of the rules, policies and procedures associated with governance are that people do not have to re-invent the wheel every time. All those things are set up so people can leverage them to provide value faster.
Think back to when you were learning to ride a bike. Hopefully your parent didn’t stand at a distance barking instructions on what to do and what not to do. He or she started by holding the back of your bike so you felt stable and supported, providing you with guidance on how to do it, words of encouragement about what you’re doing well, and constructive advice on what you could be doing better. Then something would click and you’d get it. When you looked back with a smile on your face, feeling proud of yourself, you’d see your parent was no longer holding your bike. He or she was a few steps behind you smiling back while you rode your bike all by yourself!
Remember that feeling of confidence and elation? That is a form of governance too. It isn’t about shutting things down, it is about enabling and supporting. To do this properly you need to listen and understand what the goals are and what is important. I encourage IT leaders to work closely with line of business leaders to ensure trust and confidence in the data. Everyone should know how to get the proper data they need to help the organization move forward.
Q. Can you share some examples of data governance rules, policies and procedures that are more policing than enabling?
A. An example is when “Hold” or “No” are the default responses to every access request. Typically every database request submitted sits in a queue until an administrator reviews the access request and contacts the person with a series of questions that typically add little value. Sometimes the request is granted or it’s escalated for further investigation. While there is absolutely a level of security and policing that needs to occur on sensitive information, sometimes security and governance can unnecessarily become synonymous.
A potential policy alternative is first distinguishing between sensitivity in data structures and then codifying access policies. For example, imagine someone requests read-only access to a generally available schema in the enterprise data warehouse. This person has a particular job title and works in a particular department. Another person with the same role has similar access. The process requires an “approver” to manually review and approve the request. In this instance, you could set up the access request for automatic approval. The risk will have been mitigated through the applied rules, so you have the necessary governance, but you’ve enabled the business to move faster. That’s a win for everyone involved.
Q. Can you give some concrete advice about how to kick off a successful data governance initiative using an enabling approach?
A. I have two recommendations:
- Recruit Business Partners: Make certain you have some highly respected, experienced and motivated business partners to participate in the kick-off.
- Quantify the Value: As a group, quantify the value of risk mitigation and opportunity cost. For example
- To quantify the risk, measure the dollar value of a wrong metric going to the investor community, the impact on the market value and the percentage chance of it happening. Or quantify the executive team making a wrong decision based on incorrect information.
- To quantify the opportunity, calculate the value of speed-to-market, getting a product to customers quicker than a competitor. You should be able to find examples of how much it cost your organization when you launched a product before a competitor and when you launched a product after a competitor. You can leverage that in your calculation to ensure everyone knows exactly how important enablement is.
When you work collaboratively, business and IT will be on the same page. Business leaders should understand the pressures the IT group is under to protect corporate data. The IT team should understand the pressure business leaders are under to get answers to questions quickly to cut costs and find opportunities for growth in revenue and profits.
Q. Any tips on how to enable data governance processes with technology?
A. You may want to consider these two valuable elements to make data governance and analysis even easier:
- Metadata Manager provides a frame of reference or the context to give data meaning. It enables IT staff to manage technical metadata and perform an impact analysis of a proposed change before it is implemented. While root cause analysis enables business partners to dig into a term in a report to understand the source of the data and how it was moved and transformed before it was added to a report.
- Business Glossary maintains a standard set of business definitions, accountability for its terms and an audit trail for compliance. It enables business partners and IT to collaboratively manage business metadata. To use a healthcare example, does “Claim Paid Date” mean the date it was approved, the check was cut or the check cleared? Turn to Business Glossary to find out.
Q. Can you rescue a data governance initiative that was built based on a policing approach?
A. Absolutely. It takes effort and thought but it can absolutely be done. The key to doing it is realizing the opportunity cost of having people create their own business rules and metrics. While there is a cost to the wasted labor, the greatest cost is lost opportunities. If people are spending time trying to recreate rules and reconcile numbers, they won’t have time to focus on the game changing insight you get from predictive analytics or optimization, which is where the real competitive advantage lies.
As I was scanning my BBC app on my iPhone a few weeks ago I noticed this article on how game companies are sharing files for distributed development. It talked about how EA was overcoming the development challenges of the multi-shooter game “Battlefield 4″. Not only where they handling the code itself but also very large graphics and sound files as the complete game file was larger than 50GB.
Rather than file transfer or email the whole file (impossible) or chunks (too expensive, too time consuming), they were using Panzura’s cloud storage controller to store the “master file” in the cloud and handle code and content deltas (<5% of the total file) very similarly to what a MDM environment does in the B2B space when it checks for duplicates and only syncs “approved” attribute-level net changes into the MDM hub but also back to the source system.
This is as much of a file transfer challenge around compression as it is a logical challenge detecting and automating updates when appropriate and flagging it for review when inappropriate. The similarities to a MDM system are shocking. Just as two or more CRM, billing or asset mgmt systems handle their somewhat similar, yet still different, individual “master” files of an asset, a customer, an account or a product; the game development operation syncs its copies across development and QA locations based on the fact if it is a Sony Playstation or Microsoft XBox “view” of the same game.
In the event a cloud storage provider goes belly-up – just as it happened in the BBC article – there is obviously (as there is in MDM) the possibility of a cloud-onsite hybrid.
Now my juices got flowing – Informatica should be using its experience in ETL and SOA to use the MDM Hub for use cases where structured master data need to be used to sync chunks of large files relevant to a particular transaction, say a whole life insurance application, a mutual fund annual proxy statement, a car manual, etc. Rather than mail this massive booklets every quarter or year, these files should be developed and distributed based on preference attributes linked to a customer account and location and assembled on-the-fly for the particular object in question.
Surely, the risk disclaimers, steps to change a spark plug are 80% similar between instruments or vehicles so why reprint or duplicate them electronically.
Spinning this further, what happens if developers need to understand gamers’ behavior in terms of hacks they applied and attempts/behavior to get to the next save point. This then becomes increasingly a Big Data paradigm, especially for situations where the broadband signal is run through the XBox and constant switches between TV, web browsing, a voice call, VoD, OnDemand Games and XBox Games occurs. My head is starting to smoke already. Would a switch from the game to a local TV station or HBO now indicate that the gamer was getting a bit tired or bored at a certain stage in the game….what happens if the Kinect detects they actually walked away. So much data – so little time.
These are my two cents….as I am pretty sure Informatica will not get into the gaming business any time soon. And I was so hoping I could expense this Christmas’ XBox for customer demo purposes (LoL).
Are you aware of any untraditional uses of master data, maybe in combination with knowledge or content management systems? Would love to hear some ideas.
Why Now is the Time for an Investment in Data Management
All application managers have gotten this question at some point or another. But it could be worse. Consider if the question never was asked and that bad data caused an error in a crucial business process or transaction. The damage can be significant and it happens every day.
Let’s suppose you do bad data in an enterprise application. This raises a number of very difficult questions:
- Provenance. Where did this data come from? Is the data from the right source?
- Transformation. Was the data transformed correctly as it was moved from source to target?
- Operational. Was there an operational error along the way that caused a critical process to run only partially or not at all?
- Change Management. Did somebody make a change to the data integration / data management system that looked like a logical solution to their problem, but that cause your application to receive bad data?
Good data is the crude oil (we’re all going to hear that analogy a lot more!) that business processes run on. If you have bad (or dirty) oil, you are going to have problems with the process.
Why do application managers care? After all this is data integration, not application management. The answer is pretty straightforward:
- So you don’t get questions like the one above, questions that suck up the time of your staff. (15 hours per analyst per month from one customer source)
- So bad data does not lead to bad transactions and bad decisions.
- So that bad or inconsistent data does not damage the confidence of the users of your application, causing workarounds and lack of adoption.
So, what should be done to fix this? It is time to start thinking about data integration and data quality management as a single system rather than a somewhat random collection of expensive one-off projects. The result will be lower costs, higher productivity, and greater user confidence in your enterprise applications.
For more on this and related topics, visit our Potential at Work site for Application Leaders.
I’m glad you enjoyed my last letter explaining what data is and how people in my industry make a living managing it. After that letter, you confidently answered all data-related questions your knitting-circle friends could throw at you. But then Edward Snowden, former NSA contractor and world-renowned whistle-blower, came on the scene. Suddenly mainstream news anchors are talking about metadata.
I got your panicked voicemail and, as promised, I’m going to try to clarify what metadata is and how it relates to data. (more…)