Category Archives: Data Warehousing
The verdict is in. Data is now broadly perceived as a source of competitive advantage. We all feel the heat to deliver good data. It is no wonder organizations view Analytics initiatives as highly strategic. But the big question is, can you really trust your data? Or are you just creating pretty visualizations on top of bad data?
We also know there is a shift towards self-service Analytics. But did you know that according to Gartner, “through 2016, less than 10% of self-service BI initiatives will be governed sufficiently to prevent inconsistencies that adversely affect the business”?1 This means that you may actually show up at your next big meeting and have data that contradicts your colleague’s data. Perhaps you are not working off of the same version of the truth. Maybe you have siloed data on different systems and they are not working in concert? Or is your definition of ‘revenue’ or ‘leads’ different from that of your colleague’s?
So are we taking our data for granted? Are we just assuming that it’s all available, clean, complete, integrated and consistent? As we work with organizations to support their Analytics journey, we often find that the harsh realities of data are quite different from perceptions. Let’s further investigate this perception gap.
For one, people may assume they can easily access all data. In reality, if data connectivity is not managed effectively, we often need to beg borrow and steal to get the right data from the right person. If we are lucky. In less fortunate scenarios, we may need to settle for partial data or a cheap substitute for the data we really wanted. And you know what they say, the only thing worse than no data is bad data. Right?
Another common misperception is: “Our data is clean. We have no data quality issues”. Wrong again. When we work with organizations to profile their data, they are often quite surprised to learn that their data is full of errors and gaps. One company recently discovered within one minute of starting their data profiling exercise, that millions of their customer records contained the company’s own address instead of the customers’ addresses… Oops.
Another myth is that all data is integrated. In reality, your data may reside in multiple locations: in the cloud, on premise, in Hadoop and on mainframe and anything in between. Integrating data from all these disparate and heterogeneous data sources is not a trivial task, unless you have the right tools.
And here is one more consideration to mull over. Do you find yourself manually hunting down and combining data to reproduce the same ad hoc report over and over again? Perhaps you often find yourself doing this in the wee hours of the night? Why reinvent the wheel? It would be more productive to automate the process of data ingestion and integration for reusable and shareable reports and Analytics.
Simply put, you need great data for great Analytics. We are excited to host Philip Russom of TDWI in a webinar to discuss how data management best practices can enable successful Analytics initiatives.
And how about you? Can you trust your data? Please join us for this webinar to learn more about building a trust-relationship with your data!
- Gartner Report, ‘Predicts 2015: Power Shift in Business Intelligence and Analytics Will Fuel Disruption’; Authors: Josh Parenteau, Neil Chandler, Rita L. Sallam, Douglas Laney, Alan D. Duncan; Nov 21 2014
Not so long ago, Google created a Web site to figure out just how many people had influenza. How they did this was by tracking “flu-related search queries”, “location of the query,” and applied it to an estimation algorithm. According to the website, at the flu season’s peak in January, nearly 11 percent of the United States population may have influenza. This means that nearly 44 million of us will have had the flu or flu-like symptoms. In its weekly report the Centers for Disease Control and Prevention put this at 5.6%, which means that less than 23 million of us actually went to the doctor’s office to be tested for flu or to get a flu-shot.
Now, imagine if I were a drug manufacturer. There is a theory about what went wrong. The problems may be due to widespread media coverage of this year’s flu season. Then add social media, which helped news of the flu spread quicker than the virus itself. In other words, the algorithm is looking only at the numbers, not at the context of the search results.
In today’s digitally connected world, data is everywhere: in our phones, search queries, friendships, dating profiles, cars, food, and reading habits. Almost everything we touch is part of a larger data set. The people and companies that interpret the data may fail to apply background and outside conditions to the numbers they capture.
Now, while we build our big data repositories, we have to spend some time to explain how we collected the data and under what context.
A couple comments on the importance of integration platforms like Informatica in an EDW/Hadoop environment.
- Hadoop does mean you can do some quick and inexpensive exploratory analysis with little or no ETL. The issue is that it will not perform at the level you need to take it to production. As the webinar points out, applying some structure to the data with columnar files (not RDBMS) will dramatically speed up query performance.
- The other thing that makes an integration platform more important than ever is the explosion of data complexity. As Dr. Kimball put it:
“Integration is even more important these days because you are looking at all sorts of data sources coming in from all sorts of directions.”
To perform interesting analyses, you are going to have to be able to join data with different formats and different semantic meaning. And that is going to require integration tools.
- Thirdly, if you are going to put this data into production, you will want to incorporate data cleansing, metadata management, and possibly formal data governance to ensure that your data is trustworthy, auditable, and has business context. There is no point in serving up bad data quickly and inexpensively. The result will be poor business decisions and flawed analyses.
For Data Warehouse Architects
The challenge is to deliver actionable content from the exploding amount of data available. You will need to be constantly scanning for new sources of data and looking for ways to quickly and efficiently deliver that to the point of analysis.
For Enterprise Architects
The challenge with adding Big Data to Your EDW Architecture is to define and drive a coherent enterprise data architecture across your organization that standardizes people, processes, and tools to deliver clean and secure data in the most efficient way possible. It will also be important to automate as much as possible to offload routine tasks from the IT staff. The key to that automation will be the effective use of metadata across the entire environment to not only understand the data itself, but how it is used, by whom, and for what business purpose. Once you have done that, then it will become possible to build intelligence into the environment.
For more on Informatica’s vision for an Intelligent Data Platform and how this fits into your enterprise data architecture see Think “Data First” to Drive Business Value
Every fall Informatica sales leadership puts together its strategy for the following year. The revenue target is typically a function of the number of sellers, the addressable market size and key accounts in a given territory, average spend and conversion rate given prior years’ experience, etc. This straight forward math has not changed in probably decades, but it assumes that the underlying data are 100% correct. This data includes:
- Number of accounts with a decision-making location in a territory
- Related IT spend and prioritization
- Organizational characteristics like legal ownership, industry code, credit score, annual report figures, etc.
- Key contacts, roles and sentiment
- Prior interaction (campaign response, etc.) and transaction (quotes, orders, payments, products, etc.) history with the firm
Every organization, no matter if it is a life insurer, a pharmaceutical manufacturer, a fashion retailer or a construction company knows this math and plans on getting somewhere above 85% achievement of the resulting target. Office locations, support infrastructure spend, compensation and hiring plans are based on this and communicated.
So why is it that when it is an open secret that the underlying data is far from perfect (accurate, current and useful) and corrupts outcomes, too few believe that fixing it has any revenue impact? After all, we are not projecting the climate for the next hundred years here with a thousand plus variables.
If corporate hierarchies are incorrect, your spend projections based on incorrect territory targets, credit terms and discount strategy will be off. If every client touch point does not have a complete picture of cross-departmental purchases and campaign responses, your customer acquisition cost will be too high as you will contact the wrong prospects with irrelevant offers. If billing, tax or product codes are incorrect, your billing will be off. This is a classic telecommunication example worth millions every month. If your equipment location and configuration is wrong, maintenance schedules will be incorrect and every hour of production interruption will cost an industrial manufacturer of wood pellets or oil millions.
Also, if industry leaders enjoy an upsell ratio of 17%, and you experience 3%, data (assuming you have no formal upsell policy as it violates your independent middleman relationship) data will have a lot to do with it.
The challenge is not the fact that data can create revenue improvements but how much given the other factors: people and process.
Every industry laggard can identify a few FTEs who spend 25% of their time putting one-off data repositories together for some compliance, M&A customer or marketing analytics. Organic revenue growth from net-new or previously unrealized revenue is what the focus of any data management initiative should be. Don’t get me wrong; purposeful recruitment (people), comp plans and training (processes) are important as well. Few people doubt that people and process drives revenue growth. However, few believe data being fed into these processes has an impact.
This is a head scratcher for me. An IT manager at a US upstream oil firm once told me that it would be ludicrous to think data has a revenue impact. They just fixed data because it is important so his consumers would know where all the wells are and which ones made a good profit. Isn’t that assuming data drives production revenue? (Rhetorical question)
A CFO at a smaller retail bank said during a call that his account managers know their clients’ needs and history. There is nothing more good data can add in terms of value. And this happened after twenty other folks at his bank including his own team delivered more than ten use cases, of which three were based on revenue.
Hard cost (materials and FTE) reduction is easy, cost avoidance a leap of faith to a degree but revenue is not any less concrete; otherwise, why not just throw the dice and see how the revenue will look like next year without a central customer database? Let every department have each account executive get their own data, structure it the way they want and put it on paper and make hard copies for distribution to HQ. This is not about paper versus electronic but the inability to reconcile data from many sources on paper, which is a step above electronic.
Have you ever heard of any organization move back to the Fifties and compete today? That would be a fun exercise. Thoughts, suggestions – I would be glad to hear them?
Every two years, the typical company doubles the amount of data they store. However, this Data is inherently “dumb.” Acquiring more of it only seems to compound its lack of intellect.
When revitalizing your business, I won’t ask to look at your data – not even a little bit. Instead, we look at the process of how you use the data. What I want to know is this:
How much of your day-to-day operations are driven by your data?
The Case for Smart Data
I recently learned that 7-Eleven Japan has pushed decision-making down to the store level – in fact, to the level of clerks. Store clerks decide what goes on the shelves in their individual 7-Eleven stores. These clerks push incredible inventory turns. Some 70% of the products on the shelves are new to stores each year. As a result, this chain has been the most profitable Japanese retailer for 30 years running.
Instead of just reading the data and making wild guesses on why something works and why something doesn’t, these clerks acquired the skill of looking at the quantitative and the qualitative and connected dots. Data told them what people are talking about, how it’s related to their product and how much weight it carried. You can achieve this as well. To do so, you must introduce a culture that emphasizes discipline around processes. A disciplined process culture uses:
- A template approach to data with common processes, reuse of components, and a single face presented to customers
- Employees who consistently follow standard procedures
If you cannot develop such company-wide consistency, you will not gain benefits of ERP or CRM systems.
Make data available to the masses. Like at 7-Eleven Japan, don’t centralize the data decision-making process. Instead, push it out to the ranks. By putting these cultures and practices into play, businesses can use data to run smarter.
In 2012, Forbes published an article predicting an upcoming problem.
The Need for Scalable Enterprise Analytics
Specifically, increased exploration in Big Data opportunities would place pressure on the typical corporate infrastructure. The generic hardware used to run most tech industry enterprise applications was not designed to handle real-time data processing. As a result, the explosion of mobile usages, and the proliferation of social networks, was increasing the strain on the system. Most companies now faced real-time processing requirements beyond what the traditional model was designed to handle.
In the past two years, the volume of data and speed of data growth has grown significantly. As a result, the problem has become more severe. It is now clear that these challenges can’t be overcome by simply doubling or tripling their IT spending on infrastructure sprawl. Today, enterprises seek consolidated solutions that offer scalability, performance and ease of administration. The present need is for scalable enterprise analytics.
A Clear Solution Is Available
Informatica PowerCenter and Data Quality is the market leading data integration and data quality platform. This platform has now been certified by Oracle as an optimal solution for both the Oracle Exadata Database Machine and the Oracle SuperCluster.
As the high-speed on-ramp for data into Oracle Exadata, PowerCenter and Data Quality deliver up-to five times faster performance on data load, query, profiling and cleansing tasks. Informatica’s data integration customers can now easily reuse data integration code, skills and resources to access and transform any data from any data source and load it into Exadata, with the highest throughput and scalability.
Customers adopting Oracle Exadata for high-volume, high-speed analytics can now be confident with Informatica PowerCenter and Data Quality. With these products, they can ingest, cleanse and transform all types of data into Exadata with the highest performance and scale required to maximize the value of their Exadata investment.
Proving the Value of Scalable Enterprise Analytics
In order to demonstrate the efficacy of their partnership, the two companies worked together on a Proof Of Value (POV) project. The goal is to prove that using PowerCenter with Exadata would improve both performance and scalability. The project involved PowerCenter and Data Quality 9.6.1 and x4-2 Exadata Machine. Oracle 11g was considered for both standard Oracle and Exadata versions.
The first test conducted a 1TB load test to Exadata and standard Oracle in a typical PowerCenter use case. The second test consisted of querying 1TB profiling warehouse database in Data Quality use case scenario. Performance data was collected for both tests. The scalability factor was also captured. A variant of the TPCH dataset was used to generate the test data. The results were significantly higher than prior Exabyte 1TB test. In particular:
- The data query tests achieved 5x performance.
- The data load tests achieved a 3x-5x speed increase.
- Linear scalability was achieved with read/write tests on Exadata.
What Business Benefits Could You Expect?
Informatica PowerCenter and Data Quality, along-with Oracle Exadata, now provide the best-of-breed combination of software and hardware, optimized to deliver the highest possible total system performance. These comprehensive tools drive agile reporting and analytics, while empowering IT organizations to meet SLAs and quality goals like never before.
- Extend Oracle Exadata’s access to even more business critical data sources. Utilize optimized out-of-the-box Informatica connectivity to easily access hundreds of data sources, including all the major databases, on-premise and cloud applications, mainframe, social data and Hadoop.
- Get more data, more quickly into Oracle Exadata. Move higher volumes of trusted data quickly into Exadata to support timely reporting with up-to-date information (i.e. up to 5x performance improvement compared to Oracle database).
- Centralize management and improve insight into large scale data warehouses. Deliver the necessary insights to stakeholders with intuitive data lineage and a collaborative business glossary. Contribute to high quality business analytics, in a timely manner across the enterprise.
- Instantly re-direct workloads and resources to Oracle Exadata without compromising performance. Leverage existing code and programming skills to execute high-performance data integration directly on Exadata by performing push down optimization.
- Roll-out data integration projects faster and more cost-effectively. Customers can now leverage thousands of Informatica certified developers to execute existing data integration and quality transformations directly on Oracle Exadata, without any additional coding.
- Efficiently scale-up and scale-out. Customers can now maximize performance and lower the costs of data integration and quality operations of any scale by performing Informatica workload and push down optimization on Oracle Exadata.
- Save significant costs involved in administration and expansion. Customers can now easily and economically manage large-scale analytics data warehousing environments with a single point of administration and control, and consolidate a multitude of servers on one rack.
- Reduce risk. Customers can now leverage Informatica’s data integration and quality platform to overcome the typical performance and scalability limitations seen in databases and data storage systems. This will help reduce quality-of-service risks as data volumes rise.
Oracle Exadata is a well-engineered system that offers customers out-of-box scalability and performance on demand. Informatica PowerCenter and Data Quality are optimized to run on Exadata, offering customers business benefits that speed up data integration and data quality tasks like never before. Informatica’s certified, optimized, and purpose-built solutions for Oracle can help you enable more timely and trustworthy reporting. You can now benefit from Informatica’s optimized solutions for Oracle Exadata to make better business decisions by unlocking the full potential of the most current and complete enterprise data available. As shown in our test results, you can attain up to 5x performance by scaling Exadata. Informatica Data Quality customers can perform profiling 1TB datasets, which is unheard before. We urge you to deploy the combined solution to solve your data integration and quality problems today while achieving high speed business analytics in these days of big data exploration and Internet Of Things.
Listen to what Ash Kulkarni, SVP, at OOW14 has to say on how @InformaticaCORP PowerCenter and Data Quality certified by Oracle as optimized for Exadata can deliver up-to five times faster performance improvement on data load, query, profiling, cleansing and mastering tasks, for Exadata.
This post was written by guest author Dale Kim, Director of Industry Solutions at MapR Technologies, a valued Informatica partner that provides a distribution for Apache Hadoop that ensures production success for its customers.
Apache Hadoop is growing in popularity as the foundation for an enterprise data hub. An Enterprise Data Hub (EDH) extends and optimizes the traditional data warehouse model by adding complementary big data technologies. It focuses your data warehouse on high value data by reallocating less frequently used data to an alternative platform. It also aggregates data from previously untapped sources to give you a more complete picture of data.
So you have your data, your warehouses, your analytical tools, your Informatica products, and you want to deploy an EDH… now what about Hadoop?
Requirements for Hadoop in an Enterprise Data Hub
Let’s look at characteristics required to meet your EDH needs for a production environment:
You already expect these from your existing enterprise deployments. Shouldn’t you hold Hadoop to the same standards? Let’s discuss each topic:
Enterprise-grade is about the features that keep a system running, i.e., high availability (HA), disaster recovery (DR), and data protection. HA helps a system run even when components (e.g., computers, routers, power supplies) fail. In Hadoop, this means no downtime and no data loss, but also no work loss. If a node fails, you still want jobs to run to completion. DR with remote replication or mirroring guards against site-wide disasters. Mirroring needs to be consistent to ensure recovery to a known state. Using file copy tools won’t cut it. And data protection, using snapshots, lets you recover from data corruption, especially from user or application errors. As with DR replicas, snapshots must be consistent, in that they must reflect the state of the data at the time the snapshot was taken. Not all Hadoop distributions can offer this guarantee.
Hadoop interoperability is an obvious necessity. Features like a POSIX-compliant, NFS-accessible file system let you reuse existing, file system-based applications on Hadoop data. Support for existing tools lets your developers get up to speed quickly. And integration with REST APIs enables easy, open connectivity with other systems.
You should be able to logically divide clusters to support different use cases, job types, user group, and administrators as needed. To avoid a complex, multi-cluster setup, choose a Hadoop distribution with multi-tenancy capabilities to simplify the architecture. This gives you less risk for error and no data/effort duplication.
Security should be a priority to protect against the exposure of confidential data. You should assess how you’ll handle authentication (with or without Kerberos), authorization (access controls), over-the-network encryption, and auditing. Many of these features should be native to your Hadoop distribution, and there are also strong security vendors that provide technologies for securing Hadoop.
Any large scale deployment needs fast read, write, and update capabilities. Hadoop can support the operational requirements of an EDH with integrated, in-Hadoop databases like Apache HBase™ and Accumulo™, as well as MapR-DB (the MapR NoSQL database). This in-Hadoop model helps to simplify the overall EDH architecture.
Using Hadoop as a foundation for an EDH is a powerful option for businesses. Choosing the correct Hadoop distribution is the key to deploying a successful EDH. Be sure not to take shortcuts – especially in a production environment – as you will want to hold your Hadoop platform to the same high expectations you have of your existing enterprise systems.
How are they accomplishing this? A new generation of hackers has learned to reverse engineer popular software programs (e.g. Windows, Outlook Java, etc.) in order to find so called “holes”. Once those holes are exploited, the hackers develop “bugs” that infiltrate computer systems, search for sensitive data and return it to the bad guys. These bugs are then sold in the black market to the highest bidder. When successful, these hackers can wreak havoc across the globe.
I recently read a Time Magazine article titled “World War Zero: How Hackers Fight to Steal Your Secrets.” The article discussed a new generation of software companies made up of former hackers. These firms help other software companies by identifying potential security holes, before they can be used in malicious exploits.
This constant battle between good (data and software security firms) and bad (smart, young, programmers looking to make a quick/big buck) is happening every day. Unfortunately, the average consumer (you and I) are the innocent victims of this crazy and costly war. As a consumer in today’s digital and data-centric age, I worry when I see these headlines of ongoing data breaches from the Targets of the world to my local bank down the street. I wonder not “if” but “when” I will become the next victim. According to the Ponemon institute, the average cost to a company was $3.5 million in US dollars and 15 percent more than what it cost last year.
As a 20 year software industry veteran, I’ve worked with many firms across global financial services industry. As a result, my concerned about data security exceed those of the average consumer. Here are the reasons for this:
- Everything is Digital: I remember the days when ATM machines were introduced, eliminating the need to wait in long teller lines. Nowadays, most of what we do with our financial institutions is digital and online whether on our mobile devices to desktop browsers. As such every interaction and transaction is creating sensitive data that gets disbursed across tens, hundreds, sometimes thousands of databases and systems in these firms.
- The Big Data Phenomenon: I’m not talking about sexy next generation analytic applications that promise to provide the best answer to run your business. What I am talking about is the volume of data that is being generated and collected from the countless number of computer systems (on-premise and in the cloud) that run today’s global financial services industry.
- Increase use of Off-Shore and On-Shore Development: Outsourcing technology projects to offshore development firms has be leverage off shore development partners to offset their operational and technology costs. With new technology initiatives.
Now here is the hard part. Given these trends and heightened threats, do the companies I do business with know where the data resides that they need to protect? How do they actually protect sensitive data when using it to support new IT projects both in-house or by off-shore development partners? You’d be amazed what the truth is.
According to the recent Ponemon Institute study “State of Data Centric Security” that surveyed 1,587 Global IT and IT security practitioners in 16 countries:
- Only 16 percent of the respondents believe they know where all sensitive structured data is located and a very small percentage (7 percent) know where unstructured data resides.
- Fifty-seven percent of respondents say not knowing where the organization’s sensitive or confidential data is located keeps them up at night.
- Only 19 percent say their organizations use centralized access control management and entitlements and 14 percent use file system and access audits.
Even worse, those surveyed said that not knowing where sensitive and confidential information resides is a serious threat and the percentage of respondents who believe it is a high priority in their organizations. Seventy-nine percent of respondents agree it is a significant security risk facing their organizations. But a much smaller percentage (51 percent) believes that securing and/or protecting data is a high priority in their organizations.
I don’t know about you but this is alarming and worrisome to me. I think I am ready to reach out to my banker and my local retailer and let him know about my concerns and make sure they ask and communicate my concerns to the top of their organization. In today’s globally and socially connected world, news travels fast and given how hard it is to build trustful customer relationships, one would think every business from the local mall to Wall St should be asking if they are doing what they need to identify and protect their number one digital asset – Their data.
This creative thinking to solve a problem came from a request to build a soldier knife from the Swiss Army. In the end, the solution was all about getting the right tool for the right job in the right place. In many cases soldiers didn’t need industrial strength tools, all they really needed was a compact and lightweight tool to get the job at hand done quickly.
Putting this into perspective with today’s world of Data Integration, using enterprise-class data integration tools for the smaller data integration project is over kill and typically out of reach for the smaller organization. However, these smaller data integration projects are just as important as those larger enterprise projects, and they are often the innovation behind a new way of business thinking. The traditional hand-coding approach to addressing the smaller data integration project is not-scalable, not-repeatable and prone to human error, what’s needed is a compact, flexible and powerful off-the-shelf tool.
Thankfully, over a century after the world embraced the Swiss Army Knife, someone at Informatica was paying attention to revolutionary ideas. If you’ve not yet heard the news about the Informatica platform, a version called PowerCenter Express has been released and it is free of charge so you can use it to handle an assortment of what I’d characterize as high complexity / low volume data integration challenges and experience a subset of the Informatica platform for yourself. I’d emphasize that PowerCenter Express doesn’t replace the need for Informatica’s enterprise grade products, but it is ideal for rapid prototyping, profiling data, and developing quick proof of concepts.
PowerCenter Express provides a glimpse of the evolving Informatica platform by integrating four Informatica products into a single, compact tool. There are no database dependencies and the product installs in just under 10 minutes. Much to my own surprise, I use PowerCenter express quite often going about the various aspects of my job with Informatica. I have it installed on my laptop so it travels with me wherever I go. It starts up quickly so it’s ideal for getting a little work done on an airplane.
For example, recently I wanted to explore building some rules for an upcoming proof of concept on a plane ride home so I could claw back some personal time for my weekend. I used PowerCenter Express to profile some data and create a mapping. And this mapping wasn’t something I needed to throw away and recreate in an enterprise version after my flight landed. Vibe, Informatica’s build once / run anywhere metadata driven architecture allows me to export a mapping I create in PowerCenter Express to one of the enterprise versions of Informatica’s products such as PowerCenter, DataQuality or Informatica Cloud.
As I alluded to earlier in this article, being a free offering I honestly didn’t expect too much from PowerCenter Express when I first started exploring it. However, due to my own positive experiences, I now like to think of PowerCenter Express as the Swiss Army Knife of Data Integration.
To start claiming back some of your personal time, get started with the free version of PowerCenter Express, found on the Informatica Marketplace at: https://community.informatica.com/solutions/pcexpress