Category Archives: Enterprise Data Management
This got me thinking: What is the biggest bottleneck in the delivery of business value today? I know I look at things from a data perspective, but data is the biggest bottleneck. Consider this prediction from Gartner:
“Gartner predicts organizations will spend one-third more on app integration in 2016 than they did in 2013. What’s more, by 2018, more than half the cost of implementing new large systems will be spent on integration. “
When we talk about application integration, we’re talking about moving data, synchronizing data, cleansing, data, transforming data, testing data. The question for architects and senior management is this: Do you have the Data Foundation for Execution you need to drive the business results you require to compete? The answer, unfortunately, for most companies is; No.
All too often data management is an add-on to larger application-based projects. The result is unconnected and non-interoperable islands of data across the organization. That simply is not going to work in the coming competitive environment. Here are a couple of quick examples:
- Many companies are looking to compete on their use of analytics. That requires collecting, managing, and analyzing data from multiple internal and external sources.
- Many companies are focusing on a better customer experience to drive their business. This again requires data from many internal sources, plus social, mobile and location-based data to be effective.
When I talk to architects about the business risks of not having a shared data architecture, and common tools and practices for enterprise data management, they “get” the problem. So why aren’t they addressing it? The issue is that they find that they are only funded to do the project they are working on and are dealing with very demanding timeframe requirements. They have no funding or mandate to solve the larger enterprise data management problem, which is getting more complex and brittle with each new un-connected project or initiative that is added to the pile.
Studies such as “The Data Directive” by The Economist show that organizations that actively manage their data are more successful. But, if that is the desired future state, how do you get there?
Changing an organization to look at data as the fuel that drives strategy takes hard work and leadership. It also takes a strong enterprise data architecture vision and strategy. For fresh thinking on the subject of building a data foundation for execution, see “Think Data-First to Drive Business Value” from Informatica.
* By the way, Informatica is proud to announce that we are now a sponsor of the MIT Center for Information Systems Research.
According to the Financial Executives Institute, CFOs say their second highest priority this year is to harness business intelligence and big data. Their highest priority is to improve cash flow and working capital efficiency and effectiveness. This means CFOs highest two priorities are centered around data. At roughly the same time, KPMG has found in their survey of CFOs that 91% want to improve the quality of their financial and performance insight obtained from the data that they produce. Even more amazing 51% of CFO admitted that “collecting, storing, and retrieving financial and performance data at their company is primarily accomplished through a manual and/or spreadsheet-based exercise”. From our interviews of CFOs, we believe this number is much higher.
Your question at this point—if you are not a CFO—should be how can this be the case? After all strategy consultants like Booz and Company, actively measure the degree of digitization and automation taking place in businesses by industry and these numbers year after year have shown a strong upward bias. How can the finance organization be digitized for data collection but still largely manual in its processes for putting together the figures that management and the market needs?
CFOs do not trust their data
In our interviews of CFOs, one CFO answered this question bluntly by saying “If the systems suck, then you cannot trust the numbers when you get them.” And this reality truly limits CFOs in how they respond to their top priorities. Things like management of the P&L, Expense Management, Compliance, and Regulatory all are impacted by the CFOs data problem. Instead of doing a better job at these issues, CFOs and their teams remain largely focused on “getting the numbers right”. And even worse, the answering of business questions like how much revenue is this customer providing or how profitable this customer is, involves manual pulls of data today from more than one system. And yes, similar data issues exist in financial services organizations which close the books nightly.
The CFOs, that I have talked to, admit without hesitation that data is a big issue for them. These CFOs say that they worry about data from the source and the ability to do meaningful financial or managerial analysis. They say they need to rely on data in order to report but as important they need it to help drive synergies across businesses. This matters because CFOs say they want to move from being just “bean counters” to being participants in the strategy of their enterprises.
To succeed, CFOs say that they need timely, accurate data. However, they are the first to discuss how disparate systems get in their way. CFOs believe that making their lives easier starts with the systems that support them. What they believe is needed is real integration and consolidation of data. One CFO said what is needed this way, “we need the integration of the right systems to provide the right information so we can manage and make decisions at the right time”. CFOs clearly want to know that the accounting systems are working and reliable. At the same time, CFOs want, for example, a holistic view of customer. When asked why this isn’t a marketing activity, they say this is business issue that CFOs need to help manage. “We want to understand the customer across business units. It is a finance objective because finance is responsible for business metrics and there are gaps in business metrics around customer. How much cross sell opportunities is the business as a whole pursuing?”
Chief Profitability Officers?
Jonathan Brynes at the MIT Sloan School confirms this viewpoint is becoming a larger trend when he suggests that CFOs need to take on the function of “Chief Profitability Officers”. With this hat, CFOs, in his view, need to determine which product lines, customers, segments, and channels are the most and the least profitable. Once again, this requires that CFOs tackle their data problem to have relevant, holistic information.
CIOs remain responsible for data delivery
CFOs believe that CIOs remain responsible for how data is delivered. CFOs, say that they need to lead in creating validated data and reports. Clearly, if data delivery remains a manual process, then the CFO will be severely limited in their ability to adequately support their new and strategic charter. Yet CFOs when asked if they see data as a competitive advantage say that “every CFO would view data done well as a competitive advantage”. Some CFOs even suggest that data is the last competitive advantage. This fits really well with the view of Davenport in “Competing on Analytics”. The question is how soon will CIOs and CFOs work together to get the finance organization out of its mess of manually massaging and consolidating financial and business data.
Solution Brief: The Intelligent Data Platform
You probably know this already, but I’m going to say it anyway: It’s time you changed your infrastructure. I say this because most companies are still running infrastructure optimized for ERP, CRM and other transactional systems. That’s all well and good for running IT-intensive, back-office tasks. Unfortunately, this sort of infrastructure isn’t great for today’s business imperatives of mobility, cloud computing and Big Data analytics.
Virtually all of these imperatives are fueled by information gleaned from potentially dozens of sources to reveal our users’ and customers’ activities, relationships and likes. Forward-thinking companies are using such data to find new customers, retain existing ones and increase their market share. The trick lies in translating all this disparate data into useful meaning. And to do that, IT needs to move beyond focusing solely on transactions, and instead shine a light on the interactions that matter to their customers, their products and their business processes.
They need what we at Informatica call a “Data First” perspective. You can check out my first blog first about being Data First here.
A Data First POV changes everything from product development, to business processes, to how IT organizes itself and —most especially — the impact IT has on your company’s business. That’s because cloud computing, Big Data and mobile app development shift IT’s responsibilities away from running and administering equipment, onto aggregating, organizing and improving myriad data types pulled in from internal and external databases, online posts and public sources. And that shift makes IT a more-empowering force for business change. Think about it: The ability to connect and relate the dots across data from multiple sources finally gives you real power to improve entire business processes, departments and organizations.
I like to say that the role of IT is now “big I, little t,” with that lowercase “t” representing both technology and transactions. But that role requires a new set of priorities. They are:
- Think about information infrastructure first and application infrastructure second.
- Create great data by design. Architect for connectivity, cleanliness and security. Check out the eBook Data Integration for Dummies.
- Optimize for speed and ease of use – SaaS and mobile applications change often. Click here to try Informatica Cloud for free for 30 days.
- Make data a team sport. Get tools into your users’ hands so they can prepare and interact with it.
I never said this would be easy, and there’s no blueprint for how to go about doing it. Still, I recognize that a little guidance will be helpful. In a few weeks, Informatica’s CIO Eric Johnson and I will talk about how we at Informatica practice what we preach.
Malcolm Gladwell wrote an article in The New Yorker magazine in January, 2007 entitled “Open Secrets.” In the article, he pointed out that a national-security expert had famously made a distinction between puzzles and mysteries.
Osama bin Laden’s whereabouts were, for many years, a puzzle. We couldn’t find him because we didn’t have enough information. The key to the puzzle, it was assumed, would eventually come from someone close to bin Laden, and until we could find that source, bin Laden would remain at large. In fact, that’s precisely what happened. Al-Qaida’s No. 3 leader, Khalid Sheikh Mohammed, gave authorities the nicknames of one of bin Laden’s couriers, who then became the linchpin to the CIA’s efforts to locate Bin Laden.
By contrast, the problem of what would happen in Iraq after the toppling of Saddam Hussein was a mystery. It wasn’t a question that had a simple, factual answer. Mysteries require judgments and the assessment of uncertainty, and the hard part is not that we have too little information but that we have too much.
This was written before “Big Data” was a household word and it begs the very interesting question of whether organizations and corporations that are, by anyone’s standards, totally deluged with data, are facing puzzles or mysteries. Consider the amount of data that a company like Western Union deals with.
Western Union is a 160-year old company. Having built scale in the money transfer business, the company is in the process of evolving its business model by enabling the expansion of digital products, growth of web and mobile channels, and a more personalized online customer experience. Sounds good – but get this: the company processes more than 29 transactions per seconds on average. That’s 242 million consumer-to-consumer transactions and 459 million business payments in a year. Nearly a billion transactions – a billion! As my six-year-old might say, that number is big enough “to go to the moon and back.” Layer on top of that the fact that the company operates in 200+ countries and territories, and conducts business in 120+ currencies. Senior Director and Head of Engineering Abhishek Banerjee has said, “The data is speaking to us. We just need to react to it.” That implies a puzzle, not a mystery – but only if data scientists are able to conduct statistical modeling and predictive analysis, systematically noting trends in sending and receiving behaviors. Check out what Banerjee and Western Union CTO Sanjay Saraf have to say about it here.
Or consider General Electric’s aggressive and pioneering move into what’s dubbed as the industrial internet. In a white paper entitled “The Case for an Industrial Big Data Platform: Laying the Groundwork for the New Industrial Age,” GE reveals some of the staggering statistics related to the industrial equipment that it manufactures and supports (services comprise 75% of GE’s bottom line):
- A modern wind turbine contains approximately 50 sensors and control loops which collect data every 40 milliseconds.
- A farm controller then receives more than 30 signals from each turbine at 160-millisecond intervals.
- At every one-second interval, the farm monitoring software processes 200 raw sensor data points with various associated properties with each turbine.
Phew! I’m no electricity operations expert, and you probably aren’t either. And most of us will get no further than simply wrapping our heads around the simple fact that GE turbines are collecting a LOT of data. But what the paper goes on to say should grab your attention in a big way: “The key to success for this wind farm lies in the ability to collect and deliver the right data, at the right velocity, and in the right quantities to a wide set of well-orchestrated analytics.” And the paper goes on to recommend that anyone involved in the Industrial Internet revolution strongly consider its talent requirements, with the suggestion that Chief Data officers and/or Data Scientists may be the next critical hires.
Which brings us back to Malcolm Gladwell. In the aforementioned article, Gladwell goes on to pull apart the Enron debacle, and argues that it was a prime example of the perils of too much information. “If you sat through the trial of (former CEO) Jeffrey Skilling, you’d think that the Enron scandal was a puzzle. The company, the prosecution said, conducted shady side deals that no one quite understood. Senior executives withheld critical information from investors…We were not told enough—the classic puzzle premise—was the central assumption of the Enron prosecution.” But in fact, that was not true. Enron employed complicated – but perfectly legal–accounting techniques used by companies that engage in complicated financial trading. Many journalists and professors have gone back and looked at the firm’s regulatory filings, and have come to the conclusion that, while complex and difficult to identify, all of the company’s shenanigans were right there in plain view. Enron cannot be blamed for covering up the existence of its side deals. It didn’t; it disclosed them. As Gladwell summarizes:
“Puzzles are ‘transmitter-dependent’; they turn on what we are told. Mysteries are ‘receiver dependent’; they turn on the skills of the listener.”
I would argue that this extremely complex, fast moving and seismic shift that we call Big Data will favor those who have developed the ability to attune, to listen and make sense of the data. Winners in this new world will recognize what looks like an overwhelming and intractable mystery, and break that mystery down into small and manageable chunks and demystify the landscape, to uncover the important nuggets of truth and significance.
A mid-sized insurer recently approached our team for help. They wanted to understand how they fell short in making their case to their executives. Specifically, they proposed that fixing their customer data was key to supporting the executive team’s highly aggressive 3-year growth plan. (This plan was 3x today’s revenue). Given this core organizational mission – aside from being a warm and fuzzy place to work supporting its local community – the slam dunk solution to help here is simple. Just reducing the data migration effort around the next acquisition or avoiding the ritual annual, one-off data clean-up project already pays for any tool set enhancing data acquisitions, integration and hygiene. Will it get you to 3x today’s revenue? It probably won’t. What will help are the following:
Hard cost avoidance via software maintenance or consulting elimination is the easy part of the exercise. That is why CFOs love it and focus so much on it. It is easy to grasp and immediate (aka next quarter).
Soft cost reduction, like staff redundancies are a bit harder. Despite them being viable, in my experience very few decision makers want work on a business case to lay off staff. My team had one so far. They look at these savings as freed up capacity, which can be re-deployed more productively. Productivity is also a bit harder to quantify as you typically have to understand how data travels and gets worked on between departments.
However, revenue effects are even harder and esoteric to many people as they include projections. They are often considered “soft” benefits, although they outweigh the other areas by 2-3 times in terms of impact. Ultimately, every organization runs their strategy based on projections (see the insurer in my first paragraph).
The hardest to quantify is risk. Not only is it based on projections – often from a third party (Moody’s, TransUnion, etc.) – but few people understand it. More often, clients don’t even accept you investigating this area if you don’t have an advanced degree in insurance math. Nevertheless, risk can generate extra “soft” cost avoidance (beefing up reserve account balance creating opportunity cost) but also revenue (realizing a risk premium previously ignored). Often risk profiles change due to relationships, which can be links to new “horizontal” information (transactional attributes) or vertical (hierarchical) from parent-child relationships of an entity and the parent’s or children’s transactions.
Given the above, my initial advice to the insurer would be to look at the heartache of their last acquisition, use a benchmark for IT productivity from improved data management capabilities (typically 20-26% – Yankee Group) and there you go. This is just the IT side so consider increasing the upper range by 1.4x (Harvard Business School) as every attribute change (last mobile view date) requires additional meetings on a manager, director and VP level. These people’s time gets increasingly more expensive. You could also use Aberdeen’s benchmark of 13hrs per average master data attribute fix instead.
You can also look at productivity areas, which are typically overly measured. Let’s assume a call center rep spends 20% of the average call time of 12 minutes (depending on the call type – account or bill inquiry, dispute, etc.) understanding
- Who the customer is
- What he bought online and in-store
- If he tried to resolve his issue on the website or store
- How he uses equipment
- What he cares about
- If he prefers call backs, SMS or email confirmations
- His response rate to offers
- His/her value to the company
If he spends these 20% of every call stringing together insights from five applications and twelve screens instead of one frame in seconds, which is the same information in every application he touches, you just freed up 20% worth of his hourly compensation.
Then look at the software, hardware, maintenance and ongoing management of the likely customer record sources (pick the worst and best quality one based on your current understanding), which will end up in a centrally governed instance. Per DAMA, every duplicate record will cost you between $0.45 (party) and $0.85 (product) per transaction (edit touch). At the very least each record will be touched once a year (likely 3-5 times), so multiply your duplicated record count by that and you have your savings from just de-duplication. You can also use Aberdeen’s benchmark of 71 serious errors per 1,000 records, meaning the chance of transactional failure and required effort (% of one or more FTE’s daily workday) to fix is high. If this does not work for you, run a data profile with one of the many tools out there.
If standardization of records (zip codes, billing codes, currency, etc.) is the problem, ask your business partner how many customer contacts (calls, mailing, emails, orders, invoices or account statements) fail outright and/or require validation because of these attributes. Once again, if you apply the productivity gains mentioned earlier, there are you savings. If you look at the number of orders that get delayed in form of payment or revenue recognition and the average order amount by a week or a month, you were just able to quantify how much profit (multiply by operating margin) you would be able to pull into the current financial year from the next one.
The same is true for speeding up the introduction or a new product or a change to it generating profits earlier. Note that looking at the time value of funds realized earlier is too small in most instances especially in the current interest environment.
If emails bounce back or snail mail gets returned (no such address, no such name at this address, no such domain, no such user at this domain), e(mail) verification tools can help reduce the bounces. If every mail piece (forget email due to the miniscule cost) costs $1.25 – and this will vary by type of mailing (catalog, promotion post card, statement letter), incorrect or incomplete records are wasted cost. If you can, use fully loaded print cost incl. 3rd party data prep and returns handling. You will never capture all cost inputs but take a conservative stab.
If it was an offer, reduced bounces should also improve your response rate (also true for email now). Prospect mail response rates are typically around 1.2% (Direct Marketing Association), whereas phone response rates are around 8.2%. If you know that your current response rate is half that (for argument sake) and you send out 100,000 emails of which 1.3% (Silverpop) have customer data issues, then fixing 81-93% of them (our experience) will drop the bounce rate to under 0.3% meaning more emails will arrive/be relevant. This in turn multiplied by a standard conversion rate (MarketingSherpa) of 3% (industry and channel specific) and average order (your data) multiplied by operating margin gets you a benefit value for revenue.
If product data and inventory carrying cost or supplier spend are your issue, find out how many supplier shipments you receive every month, the average cost of a part (or cost range), apply the Aberdeen master data failure rate (71 in 1,000) to use cases around lack of or incorrect supersession or alternate part data, to assess the value of a single shipment’s overspend. You can also just use the ending inventory amount from the 10-k report and apply 3-10% improvement (Aberdeen) in a top-down approach. Alternatively, apply 3.2-4.9% to your annual supplier spend (KPMG).
You could also investigate the expediting or return cost of shipments in a period due to incorrectly aggregated customer forecasts, wrong or incomplete product information or wrong shipment instructions in a product or location profile. Apply Aberdeen’s 5% improvement rate and there you go.
Consider that a North American utility told us that just fixing their 200 Tier1 suppliers’ product information achieved an increase in discounts from $14 to $120 million. They also found that fixing one basic out of sixty attributes in one part category saves them over $200,000 annually.
So what ROI percentages would you find tolerable or justifiable for, say an EDW project, a CRM project, a new claims system, etc.? What would the annual savings or new revenue be that you were comfortable with? What was the craziest improvement you have seen coming to fruition, which nobody expected?
Next time, I will add some more “use cases” to the list and look at some philosophical implications of averages.
This post was written by guest author Dale Kim, Director of Industry Solutions at MapR Technologies, a valued Informatica partner that provides a distribution for Apache Hadoop that ensures production success for its customers.
Apache Hadoop is growing in popularity as the foundation for an enterprise data hub. An Enterprise Data Hub (EDH) extends and optimizes the traditional data warehouse model by adding complementary big data technologies. It focuses your data warehouse on high value data by reallocating less frequently used data to an alternative platform. It also aggregates data from previously untapped sources to give you a more complete picture of data.
So you have your data, your warehouses, your analytical tools, your Informatica products, and you want to deploy an EDH… now what about Hadoop?
Requirements for Hadoop in an Enterprise Data Hub
Let’s look at characteristics required to meet your EDH needs for a production environment:
You already expect these from your existing enterprise deployments. Shouldn’t you hold Hadoop to the same standards? Let’s discuss each topic:
Enterprise-grade is about the features that keep a system running, i.e., high availability (HA), disaster recovery (DR), and data protection. HA helps a system run even when components (e.g., computers, routers, power supplies) fail. In Hadoop, this means no downtime and no data loss, but also no work loss. If a node fails, you still want jobs to run to completion. DR with remote replication or mirroring guards against site-wide disasters. Mirroring needs to be consistent to ensure recovery to a known state. Using file copy tools won’t cut it. And data protection, using snapshots, lets you recover from data corruption, especially from user or application errors. As with DR replicas, snapshots must be consistent, in that they must reflect the state of the data at the time the snapshot was taken. Not all Hadoop distributions can offer this guarantee.
Hadoop interoperability is an obvious necessity. Features like a POSIX-compliant, NFS-accessible file system let you reuse existing, file system-based applications on Hadoop data. Support for existing tools lets your developers get up to speed quickly. And integration with REST APIs enables easy, open connectivity with other systems.
You should be able to logically divide clusters to support different use cases, job types, user group, and administrators as needed. To avoid a complex, multi-cluster setup, choose a Hadoop distribution with multi-tenancy capabilities to simplify the architecture. This gives you less risk for error and no data/effort duplication.
Security should be a priority to protect against the exposure of confidential data. You should assess how you’ll handle authentication (with or without Kerberos), authorization (access controls), over-the-network encryption, and auditing. Many of these features should be native to your Hadoop distribution, and there are also strong security vendors that provide technologies for securing Hadoop.
Any large scale deployment needs fast read, write, and update capabilities. Hadoop can support the operational requirements of an EDH with integrated, in-Hadoop databases like Apache HBase™ and Accumulo™, as well as MapR-DB (the MapR NoSQL database). This in-Hadoop model helps to simplify the overall EDH architecture.
Using Hadoop as a foundation for an EDH is a powerful option for businesses. Choosing the correct Hadoop distribution is the key to deploying a successful EDH. Be sure not to take shortcuts – especially in a production environment – as you will want to hold your Hadoop platform to the same high expectations you have of your existing enterprise systems.
A few days ago, I came across a post, 5 C’s of MDM (Case, Content, Connecting, Cleansing, and Controlling), by Peter Krensky, Sr. Research Associate, Aberdeen Group and this response by Alan Duncan with his 5 C’s (Communicate, Co-operate, Collaborate, Cajole and Coerce). I like Alan’s list much better. Even though I work for a product company specializing in information management technology, the secret to successful enterprise information management (EIM) is in tackling the business and organizational issues, not the technology challenges. Fundamentally, data management at the enterprise level is an agreement problem, not a technology problem.
So, here I go with my 5 C’s: (more…)
Configuring your Oracle environment for using PowerExchange CDC can be challenging, but there are some best practices you can follow that will greatly simplify the process. There are two major factors to consider when approaching this: latency requirements for your data and the ability to restart your environment.
Data Latency Requirements
The first factor that will effect latency of your data is the location of your PowerExchange CDC installation. From a best practice perspective, it is optimal to install the PowerExchange Listener on the source database server as this eliminates the need to pass data across the network and will provide the smallest amount of latency from source to target.
The volume of data that PowerExchange CDC has to process can also have a significant impact on performance. There are several items in addition to the changed data that can effect performance, including, but are not limited to, Oracle catalog dumps, Oracle workload monitor customizations and other non-Oracle tools that use the redo logs. You should conduct a review of all the processes that access Oracle redo logs, and make an effort to minimize them in terms both volume and frequency. For example, you could monitor the redo log switches and the creation of archived log files to see how busy the source database is. The size of your production archive logs and knowing how often they are being created will provide the information necessary to properly configure PowerExchange CDC.
Environment Restart Ability
When certain changes are made to the source database environment, the PowerExchange CDC process will need to be stopped and restarted. The amount of time this restart takes should be considered whenever this needs to occur. PowerExchange CDC must be restarted when any of the following changes occur:
- A change is made to the schema or a table that is part of the CDC process
- An existing Capture Registration is changed
- A change is made to the PowerExchange configuration files
- An Oracle patch is applied
- An Operating System patch or upgrade is applied
- A PowerExchange version upgrade or service pack is applied
If using the CDC with LogMiner, a copy of the Oracle catalog must be placed on the archive log in order to function properly. The frequency of these copies is site-specific and will have an impact on the amount of time it will take to restart the CDC process.
Once your PowerExchange CDC process is in production, any changes to the environment must have extensive impact analysis performed to ensure the integrity of the data and the transactions remains intact upon restart. Understanding the configurable parameters in the PowerExchange configuration files that will assist restart performance is of the utmost importance.
Even with the challenges presented when configuring PowerExchange CDC for Oracle, there are trusted and proven methods that can significantly improve your ability to complete this process and have real time or near real time access to your data. At SSG, we’re committed to always utilizing best practice methodology with our PowerExchange Baseline Deployments. In addition, we provide in-depth knowledge transfer to set end users up with a solid foundation for optimizing PowerExchange functionality.
Visit the Informatica Marketplace to learn more about SSG’s Baseline Deployment offerings.