Category Archives: Data Warehousing
I have two kids. In school. They generate a remarkable amount of paper. From math worksheets, permission slips, book reports (now called reading responses) to newsletters from the school. That’s a lot of paper. All of it is presented in different forms with different results – the math worksheets tell me how my child is doing in math, the permission slips tell me when my kids will be leaving school property and the book reports tell me what kind of books my child is interested in reading. I need to put the math worksheet information into a storage space so I can figure out how to prop up my kid if needed on the basic geometry constructs. The dates that permission slips are covering need to go into the calendar. The book reports can be used at the library to choose the next book.
We are facing a similar problem (albeit on a MUCH larger scale) in the insurance market. We are getting data from clinicians. Many of you are developing and deploying mobile applications to help patients manage their care, locate providers and improve their health. You may capture licensing data to assist pharmaceutical companies identify patients for inclusion in clinical trials. You have advanced analytics systems for fraud detection and to check the accuracy and consistency of claims. Possibly you are at the point of near real-time claim authorization.
The amount of data generated in our world is expected to increase significantly in the coming years. There are an estimated 50 petabytes of data in the Healthcare realm, which is predicted to grow by a factor of 50 to 25,000 petabytes by 2020. Healthcare payers already store and analyze some of this data. However in order to capture, integrate and interrogate large information sets, the scope of the payer information will have to increase significantly to include provider data, social data, government data, pharmaceutical and medical product manufacturers data, and information aggregator data.
Right now – you probably depend on a traditional data warehouse model and structured data analytics to access some of your data. This has worked adequately for you up to now, but with the amount of data that will be generated in the future, you need the processing capability to load and query multi-terabyte datasets in a timely fashion. You need the ability to manage both semi-structured and unstructured data.
Fortunately, a set of emerging technologies (called “Big Data”) may provide the technical foundation of a solution. Big Data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage and process data within a tolerable amount of time. While some existing technology may prove inadequate to future tasks, many of the information management methods of the past will prove to be as valuable as ever. Assembling successful Big Data solutions will require a fusion of new technology and old-school disciplines:
Which of these technologies do you have? Which of these technologies can integrate with on-premise AND cloud based solutions? On which of these technologies does your organization have knowledgeable resources that can utilize the capabilities to take advantage of Big Data?
In a previous life, I was a pastry chef in a now-defunct restaurant. One of the things I noticed while working there (and frankly while cooking at home) is that the better the ingredients, the better the final result. If we used poor quality apples in the apple tart, we ended up with a soupy, flavorless mess with a chewy crust.
The same analogy can be applied to Data Analytics. With poor quality data, you get poor results from your analytics projects. We all know that companies that can implement fantastic analytic solutions that can provide near real-time access to consumer trends are the same companies that can do successful targeted marketing campaigns that are of the minute. The Data Warehousing Institute estimates that data quality problems cost U.S. businesses more than $600 billion a year.
The business impact of poor data quality cannot be underestimated. If not identified and corrected early on, defective data can contaminate all downstream systems and information assets, jacking up costs, jeopardizing customer relationships, and causing imprecise forecasts and poor decisions.
- To help you quantify: Let’s say your company receives 2 million claims per month with 377 data elements per claim. Even at an error rate of .001, the claims data contains more than 754,000 errors per month and more than 9.04 million errors per year! If you determine that 10 percent of the data elements are critical to your business decisions and processes, you still must fix almost 1 million errors each year!
- What is your exposure to these errors? Let’s estimate the risk at $10 per error (including staff time required to fix the error downstream after a customer discovers it, the loss of customer trust and loyalty and erroneous payouts. Your company’s risk exposure to poor quality claims data is $10 million a year.
Once your company values quality data as a critical resource – it is much easier to perform high-value analytics that have an impact on your bottom line. Start with creation of a Data Quality program. Data is a critical asset in the information economy, and the quality of a company’s data is a good predictor of its future success.
To level set, let’s make sure you understand my definition of dark data. I prefer using visualizations when I can so, picture this: the end of the first Indiana Jones movie, Raiders of the Lost Ark. In this scene, we see the Ark of the Covenant, stored in a generic container, being moved down the aisle in a massive warehouse full of other generic containers. What’s in all those containers? It’s pretty much anyone’s guess. There may be a record somewhere, but, for all intents and purposes, the materials stored in those boxes are useless.
Applying this to data, once a piece of data gets shoved into some generic container and is stored away, just like the Arc, the data becomes essentially worthless. This is dark data.
Opening up a government agency to all its dark data can have significant impacts, both positive and negative. Here are couple initial tips to get you thinking in the right direction:
- Begin with the end in mind – identify quantitative business benefits of exposing certain dark data.
- Determine what’s truly available – perform a discovery project – seek out data hidden in the corners of your agency – databases, documents, operational systems, live streams, logs, etc.
- Create an extraction plan – determine how you will get access to the data, how often does the data update, how will handle varied formats?
- Ingest the data – transform the data if needed, integrate if needed, capture as much metadata as possible (never assume you won’t need a metadata field, that’s just about the time you will be proven wrong).
- Govern the data – establish standards for quality, access controls, security protections, semantic consistency, etc. – don’t skimp here, the impact of bad data can never really be quantified.
- Store it – it’s interesting how often agencies think this is the first step
- Get the data ready to be useful to people, tools and applications – think about how to minimalize the need for users to manipulate data – reformatting, parsing, filtering, etc. – to better enable self-service.
- Make it available – at this point, the data should be easily accessible, easily discoverable, easily used by people, tools and applications.
Clearly, there’s more to shining the light on dark data than I can offer in this post. If you’d like to take the next step to learning what is possible, I suggest you download the eBook, The Dark Data Imperative.
The verdict is in. Data is now broadly perceived as a source of competitive advantage. We all feel the heat to deliver good data. It is no wonder organizations view Analytics initiatives as highly strategic. But the big question is, can you really trust your data? Or are you just creating pretty visualizations on top of bad data?
We also know there is a shift towards self-service Analytics. But did you know that according to Gartner, “through 2016, less than 10% of self-service BI initiatives will be governed sufficiently to prevent inconsistencies that adversely affect the business”?1 This means that you may actually show up at your next big meeting and have data that contradicts your colleague’s data. Perhaps you are not working off of the same version of the truth. Maybe you have siloed data on different systems and they are not working in concert? Or is your definition of ‘revenue’ or ‘leads’ different from that of your colleague’s?
So are we taking our data for granted? Are we just assuming that it’s all available, clean, complete, integrated and consistent? As we work with organizations to support their Analytics journey, we often find that the harsh realities of data are quite different from perceptions. Let’s further investigate this perception gap.
For one, people may assume they can easily access all data. In reality, if data connectivity is not managed effectively, we often need to beg borrow and steal to get the right data from the right person. If we are lucky. In less fortunate scenarios, we may need to settle for partial data or a cheap substitute for the data we really wanted. And you know what they say, the only thing worse than no data is bad data. Right?
Another common misperception is: “Our data is clean. We have no data quality issues”. Wrong again. When we work with organizations to profile their data, they are often quite surprised to learn that their data is full of errors and gaps. One company recently discovered within one minute of starting their data profiling exercise, that millions of their customer records contained the company’s own address instead of the customers’ addresses… Oops.
Another myth is that all data is integrated. In reality, your data may reside in multiple locations: in the cloud, on premise, in Hadoop and on mainframe and anything in between. Integrating data from all these disparate and heterogeneous data sources is not a trivial task, unless you have the right tools.
And here is one more consideration to mull over. Do you find yourself manually hunting down and combining data to reproduce the same ad hoc report over and over again? Perhaps you often find yourself doing this in the wee hours of the night? Why reinvent the wheel? It would be more productive to automate the process of data ingestion and integration for reusable and shareable reports and Analytics.
Simply put, you need great data for great Analytics. We are excited to host Philip Russom of TDWI in a webinar to discuss how data management best practices can enable successful Analytics initiatives.
And how about you? Can you trust your data? Please join us for this webinar to learn more about building a trust-relationship with your data!
- Gartner Report, ‘Predicts 2015: Power Shift in Business Intelligence and Analytics Will Fuel Disruption’; Authors: Josh Parenteau, Neil Chandler, Rita L. Sallam, Douglas Laney, Alan D. Duncan; Nov 21 2014
Not so long ago, Google created a Web site to figure out just how many people had influenza. How they did this was by tracking “flu-related search queries”, “location of the query,” and applied it to an estimation algorithm. According to the website, at the flu season’s peak in January, nearly 11 percent of the United States population may have influenza. This means that nearly 44 million of us will have had the flu or flu-like symptoms. In its weekly report the Centers for Disease Control and Prevention put this at 5.6%, which means that less than 23 million of us actually went to the doctor’s office to be tested for flu or to get a flu-shot.
Now, imagine if I were a drug manufacturer. There is a theory about what went wrong. The problems may be due to widespread media coverage of this year’s flu season. Then add social media, which helped news of the flu spread quicker than the virus itself. In other words, the algorithm is looking only at the numbers, not at the context of the search results.
In today’s digitally connected world, data is everywhere: in our phones, search queries, friendships, dating profiles, cars, food, and reading habits. Almost everything we touch is part of a larger data set. The people and companies that interpret the data may fail to apply background and outside conditions to the numbers they capture.
Now, while we build our big data repositories, we have to spend some time to explain how we collected the data and under what context.
A couple comments on the importance of integration platforms like Informatica in an EDW/Hadoop environment.
- Hadoop does mean you can do some quick and inexpensive exploratory analysis with little or no ETL. The issue is that it will not perform at the level you need to take it to production. As the webinar points out, applying some structure to the data with columnar files (not RDBMS) will dramatically speed up query performance.
- The other thing that makes an integration platform more important than ever is the explosion of data complexity. As Dr. Kimball put it:
“Integration is even more important these days because you are looking at all sorts of data sources coming in from all sorts of directions.”
To perform interesting analyses, you are going to have to be able to join data with different formats and different semantic meaning. And that is going to require integration tools.
- Thirdly, if you are going to put this data into production, you will want to incorporate data cleansing, metadata management, and possibly formal data governance to ensure that your data is trustworthy, auditable, and has business context. There is no point in serving up bad data quickly and inexpensively. The result will be poor business decisions and flawed analyses.
For Data Warehouse Architects
The challenge is to deliver actionable content from the exploding amount of data available. You will need to be constantly scanning for new sources of data and looking for ways to quickly and efficiently deliver that to the point of analysis.
For Enterprise Architects
The challenge with adding Big Data to Your EDW Architecture is to define and drive a coherent enterprise data architecture across your organization that standardizes people, processes, and tools to deliver clean and secure data in the most efficient way possible. It will also be important to automate as much as possible to offload routine tasks from the IT staff. The key to that automation will be the effective use of metadata across the entire environment to not only understand the data itself, but how it is used, by whom, and for what business purpose. Once you have done that, then it will become possible to build intelligence into the environment.
For more on Informatica’s vision for an Intelligent Data Platform and how this fits into your enterprise data architecture see Think “Data First” to Drive Business Value
Every fall Informatica sales leadership puts together its strategy for the following year. The revenue target is typically a function of the number of sellers, the addressable market size and key accounts in a given territory, average spend and conversion rate given prior years’ experience, etc. This straight forward math has not changed in probably decades, but it assumes that the underlying data are 100% correct. This data includes:
- Number of accounts with a decision-making location in a territory
- Related IT spend and prioritization
- Organizational characteristics like legal ownership, industry code, credit score, annual report figures, etc.
- Key contacts, roles and sentiment
- Prior interaction (campaign response, etc.) and transaction (quotes, orders, payments, products, etc.) history with the firm
Every organization, no matter if it is a life insurer, a pharmaceutical manufacturer, a fashion retailer or a construction company knows this math and plans on getting somewhere above 85% achievement of the resulting target. Office locations, support infrastructure spend, compensation and hiring plans are based on this and communicated.
So why is it that when it is an open secret that the underlying data is far from perfect (accurate, current and useful) and corrupts outcomes, too few believe that fixing it has any revenue impact? After all, we are not projecting the climate for the next hundred years here with a thousand plus variables.
If corporate hierarchies are incorrect, your spend projections based on incorrect territory targets, credit terms and discount strategy will be off. If every client touch point does not have a complete picture of cross-departmental purchases and campaign responses, your customer acquisition cost will be too high as you will contact the wrong prospects with irrelevant offers. If billing, tax or product codes are incorrect, your billing will be off. This is a classic telecommunication example worth millions every month. If your equipment location and configuration is wrong, maintenance schedules will be incorrect and every hour of production interruption will cost an industrial manufacturer of wood pellets or oil millions.
Also, if industry leaders enjoy an upsell ratio of 17%, and you experience 3%, data (assuming you have no formal upsell policy as it violates your independent middleman relationship) data will have a lot to do with it.
The challenge is not the fact that data can create revenue improvements but how much given the other factors: people and process.
Every industry laggard can identify a few FTEs who spend 25% of their time putting one-off data repositories together for some compliance, M&A customer or marketing analytics. Organic revenue growth from net-new or previously unrealized revenue is what the focus of any data management initiative should be. Don’t get me wrong; purposeful recruitment (people), comp plans and training (processes) are important as well. Few people doubt that people and process drives revenue growth. However, few believe data being fed into these processes has an impact.
This is a head scratcher for me. An IT manager at a US upstream oil firm once told me that it would be ludicrous to think data has a revenue impact. They just fixed data because it is important so his consumers would know where all the wells are and which ones made a good profit. Isn’t that assuming data drives production revenue? (Rhetorical question)
A CFO at a smaller retail bank said during a call that his account managers know their clients’ needs and history. There is nothing more good data can add in terms of value. And this happened after twenty other folks at his bank including his own team delivered more than ten use cases, of which three were based on revenue.
Hard cost (materials and FTE) reduction is easy, cost avoidance a leap of faith to a degree but revenue is not any less concrete; otherwise, why not just throw the dice and see how the revenue will look like next year without a central customer database? Let every department have each account executive get their own data, structure it the way they want and put it on paper and make hard copies for distribution to HQ. This is not about paper versus electronic but the inability to reconcile data from many sources on paper, which is a step above electronic.
Have you ever heard of any organization move back to the Fifties and compete today? That would be a fun exercise. Thoughts, suggestions – I would be glad to hear them?
Every two years, the typical company doubles the amount of data they store. However, this Data is inherently “dumb.” Acquiring more of it only seems to compound its lack of intellect.
When revitalizing your business, I won’t ask to look at your data – not even a little bit. Instead, we look at the process of how you use the data. What I want to know is this:
How much of your day-to-day operations are driven by your data?
The Case for Smart Data
I recently learned that 7-Eleven Japan has pushed decision-making down to the store level – in fact, to the level of clerks. Store clerks decide what goes on the shelves in their individual 7-Eleven stores. These clerks push incredible inventory turns. Some 70% of the products on the shelves are new to stores each year. As a result, this chain has been the most profitable Japanese retailer for 30 years running.
Instead of just reading the data and making wild guesses on why something works and why something doesn’t, these clerks acquired the skill of looking at the quantitative and the qualitative and connected dots. Data told them what people are talking about, how it’s related to their product and how much weight it carried. You can achieve this as well. To do so, you must introduce a culture that emphasizes discipline around processes. A disciplined process culture uses:
- A template approach to data with common processes, reuse of components, and a single face presented to customers
- Employees who consistently follow standard procedures
If you cannot develop such company-wide consistency, you will not gain benefits of ERP or CRM systems.
Make data available to the masses. Like at 7-Eleven Japan, don’t centralize the data decision-making process. Instead, push it out to the ranks. By putting these cultures and practices into play, businesses can use data to run smarter.
In 2012, Forbes published an article predicting an upcoming problem.
The Need for Scalable Enterprise Analytics
Specifically, increased exploration in Big Data opportunities would place pressure on the typical corporate infrastructure. The generic hardware used to run most tech industry enterprise applications was not designed to handle real-time data processing. As a result, the explosion of mobile usages, and the proliferation of social networks, was increasing the strain on the system. Most companies now faced real-time processing requirements beyond what the traditional model was designed to handle.
In the past two years, the volume of data and speed of data growth has grown significantly. As a result, the problem has become more severe. It is now clear that these challenges can’t be overcome by simply doubling or tripling their IT spending on infrastructure sprawl. Today, enterprises seek consolidated solutions that offer scalability, performance and ease of administration. The present need is for scalable enterprise analytics.
A Clear Solution Is Available
Informatica PowerCenter and Data Quality is the market leading data integration and data quality platform. This platform has now been certified by Oracle as an optimal solution for both the Oracle Exadata Database Machine and the Oracle SuperCluster.
As the high-speed on-ramp for data into Oracle Exadata, PowerCenter and Data Quality deliver up-to five times faster performance on data load, query, profiling and cleansing tasks. Informatica’s data integration customers can now easily reuse data integration code, skills and resources to access and transform any data from any data source and load it into Exadata, with the highest throughput and scalability.
Customers adopting Oracle Exadata for high-volume, high-speed analytics can now be confident with Informatica PowerCenter and Data Quality. With these products, they can ingest, cleanse and transform all types of data into Exadata with the highest performance and scale required to maximize the value of their Exadata investment.
Proving the Value of Scalable Enterprise Analytics
In order to demonstrate the efficacy of their partnership, the two companies worked together on a Proof Of Value (POV) project. The goal is to prove that using PowerCenter with Exadata would improve both performance and scalability. The project involved PowerCenter and Data Quality 9.6.1 and x4-2 Exadata Machine. Oracle 11g was considered for both standard Oracle and Exadata versions.
The first test conducted a 1TB load test to Exadata and standard Oracle in a typical PowerCenter use case. The second test consisted of querying 1TB profiling warehouse database in Data Quality use case scenario. Performance data was collected for both tests. The scalability factor was also captured. A variant of the TPCH dataset was used to generate the test data. The results were significantly higher than prior Exabyte 1TB test. In particular:
- The data query tests achieved 5x performance.
- The data load tests achieved a 3x-5x speed increase.
- Linear scalability was achieved with read/write tests on Exadata.
What Business Benefits Could You Expect?
Informatica PowerCenter and Data Quality, along-with Oracle Exadata, now provide the best-of-breed combination of software and hardware, optimized to deliver the highest possible total system performance. These comprehensive tools drive agile reporting and analytics, while empowering IT organizations to meet SLAs and quality goals like never before.
- Extend Oracle Exadata’s access to even more business critical data sources. Utilize optimized out-of-the-box Informatica connectivity to easily access hundreds of data sources, including all the major databases, on-premise and cloud applications, mainframe, social data and Hadoop.
- Get more data, more quickly into Oracle Exadata. Move higher volumes of trusted data quickly into Exadata to support timely reporting with up-to-date information (i.e. up to 5x performance improvement compared to Oracle database).
- Centralize management and improve insight into large scale data warehouses. Deliver the necessary insights to stakeholders with intuitive data lineage and a collaborative business glossary. Contribute to high quality business analytics, in a timely manner across the enterprise.
- Instantly re-direct workloads and resources to Oracle Exadata without compromising performance. Leverage existing code and programming skills to execute high-performance data integration directly on Exadata by performing push down optimization.
- Roll-out data integration projects faster and more cost-effectively. Customers can now leverage thousands of Informatica certified developers to execute existing data integration and quality transformations directly on Oracle Exadata, without any additional coding.
- Efficiently scale-up and scale-out. Customers can now maximize performance and lower the costs of data integration and quality operations of any scale by performing Informatica workload and push down optimization on Oracle Exadata.
- Save significant costs involved in administration and expansion. Customers can now easily and economically manage large-scale analytics data warehousing environments with a single point of administration and control, and consolidate a multitude of servers on one rack.
- Reduce risk. Customers can now leverage Informatica’s data integration and quality platform to overcome the typical performance and scalability limitations seen in databases and data storage systems. This will help reduce quality-of-service risks as data volumes rise.
Oracle Exadata is a well-engineered system that offers customers out-of-box scalability and performance on demand. Informatica PowerCenter and Data Quality are optimized to run on Exadata, offering customers business benefits that speed up data integration and data quality tasks like never before. Informatica’s certified, optimized, and purpose-built solutions for Oracle can help you enable more timely and trustworthy reporting. You can now benefit from Informatica’s optimized solutions for Oracle Exadata to make better business decisions by unlocking the full potential of the most current and complete enterprise data available. As shown in our test results, you can attain up to 5x performance by scaling Exadata. Informatica Data Quality customers can perform profiling 1TB datasets, which is unheard before. We urge you to deploy the combined solution to solve your data integration and quality problems today while achieving high speed business analytics in these days of big data exploration and Internet Of Things.
Listen to what Ash Kulkarni, SVP, at OOW14 has to say on how @InformaticaCORP PowerCenter and Data Quality certified by Oracle as optimized for Exadata can deliver up-to five times faster performance improvement on data load, query, profiling, cleansing and mastering tasks, for Exadata.