Tag Archives: data warehouse
Data Warehouse Optimization (DWO) is becoming a popular term that describes how an organization optimizes their data storage and processing for cost and performance while data volumes continue to grow from an ever increasing variety of data sources.
Data warehouses are reaching their capacity much too quickly as the demand for more data and more types of data are forcing IT organizations into very costly upgrades. Further compounding the problem is that many organizations don’t have a strategy for managing the lifecycle of their data. It is not uncommon for much of the data in a data warehouse to be unused or infrequently used or that too much compute capacity is consumed by extract-load-transform (ELT) processing. This is sometimes the result of business requests for one off business reports that are no longer used or staging raw data in the data warehouse. A large global bank’s data warehouse was exploding with 200TB of data forcing them to consider an upgrade that would cost $20 million. They discovered that much of the data was no longer being used and could be archived to lower cost storage thereby avoiding the upgrade and saving millions. This same bank continues to retire data monthly resulting in on-going savings of $2-3 million annually. A large healthcare insurance company discovered that fewer than 2% of their ELT scripts were consuming 65% of their data warehouse CPU capacity. This company is now looking at Hadoop as a staging platform to offload the storage of raw data and ELT processing freeing up their data warehouse to support the hundreds of concurrent business users. A global media & entertainment company saw their data increase by 20x per year and the associated costs increase 3x within 6 months as they on-boarded more data such as web clickstream data from thousands of web sites and in-game telemetry data.
In this era of big data, not all data is created equal with most raw data originating from machine log files, social media, or years of original transaction data considered to be of lower value – at least until it has been prepared and refined for analysis. This raw data should be staged in Hadoop to reduce storage and data preparation costs while the data warehouse capacity should be reserved for refined, curated and frequently used datasets. Therefore, it’s time to consider optimizing your data warehouse environment to lower costs, increase capacity, optimize performance, and establish an infrastructure that can support growing data volumes from a variety of data sources. Informatica has a complete solution available for data warehouse optimization.
The first step in the optimization process as illustrated in Figure 1 below is to identify inactive and infrequently used data and ELT performance bottlenecks in the data warehouse. Step 2 is to offload the data and ELT processing identified in step 1 to Hadoop. PowerCenter customers have the advantage of Vibe which allows them to map once and deploy anywhere so that ELT processing executed through PowerCenter pushdown capabilities can be converted to ETL processing on Hadoop as part of a simple configuration step during deployment. Most raw data, such as original transaction data, log files (e.g. Internet clickstream), social media, sensor device, and machine data should be staged in Hadoop as noted in step 3. Informatica provides near-universal connectivity to all types of data so that you can load data directly into Hadoop. You can even replicate entire schemas and files into Hadoop, capture just the changes, and stream millions of transactions per second into Hadoop such as machine data. The Informatica PowerCenter Big Data Edition makes every PowerCenter developer a Hadoop developer without having to learn Hadoop so that all ETL, data integration and data quality can be executed natively on Hadoop using readily available resource skills while increasing productivity up to 5x over hand-coding. Informatica also provides data discovery and profiling tools on Hadoop to help data science teams collaborate and understand their data. The final step is to move the resulting high value and frequently used data sets prepared and refined on Hadoop into the data warehouse that supports your enterprise BI and analytics applications.
To get started, Informatica has teamed up with Cloudera to deliver a reference architecture for data warehouse optimization so organizations can lower infrastructure and operational costs, optimize performance and scalability, and ensure enterprise-ready deployments that meet business SLA’s. To learn more please join the webinar A Big Data Reference Architecture for Data Warehouse Optimization on Tuesday November 19 at 8:00am PST.
Figure 1: Process steps for Data Warehouse Optimization
Data warehouses tend to grow very quickly because they integrate data from multiple sources and maintain years of historical data for analytics. A number of our customers have data warehouses in the hundreds of terabytes to petabytes range. Managing such a large amount of data becomes a challenge. How do you curb runaway costs in such an environment? Completing maintenance tasks within the prescribed window and ensuring acceptable performance are also big challenges.
We have provided best practices to archive aged data from data warehouses. Archiving data will keep the production data size at almost a constant level, reducing infrastructure and maintenance costs, while keeping performance up. At the same time, you can still access the archived data directly if you really need to from any reporting tool. Yet many are loath to move data out of their production system. This year, at Informatica World, we’re going to discuss another method of managing data growth without moving data out of the production data warehouse. I’m not going to tell you what this new method is, yet. You’ll have to come and learn more about it at my breakout session at Informatica World: What’s New from Informatica to Improve Data Warehouse Performance and Lower Costs.
I look forward to seeing all of you at Aria, Las Vegas next month. Also, I am especially excited to see our ILM customers at our second Product Advisory Council again this year.
In my previous blog, I explained how Column-oriented Database Management Systems (CDBMS), also known as columnar databases or CBAT, offer a distinct advantage over the traditional row-oriented RDBMS in terms of I/O workload, deriving primarily from basing the granularity of I/O operations on the column rather than the entire row. This technological advantage has a direct impact on the complexity of data modeling tasks and on the end-user’s experience of the data warehouse, and this is what I will discuss in today’s post. (more…)
I was at an IT conference a few years ago. The speaker was talking about application testing. At the beginning of his talk, he asked the audience:
“Please raise your hand if you flew here from out of town.”
Most of the audience raised their hands. The speaker then said:
“OK, now if you knew that the airplane you flew on had been tested the same way your company tests its applications, would you have still flown on that plane?
After some uneasy chuckling, every hand went down. Not a great affirmation of the state of application testing in most IT shops. (more…)
The reality in data warehousing is that the primary focus is on delivery. The data warehouse team is tasked with extracting, transforming, integrating, and loading data into the warehouse within increasingly tight timeframes. Twenty years ago, monthly data warehouse loads were common. Ten years ago, weekly loads became the norm. Five years ago, daily loads were called for. Nowadays, near-real-time analytics demands the data warehouse be loaded more frequently than once a day. (more…)
Has big data entered the “trough of disillusionment?” That’s what I’ve heard recently. Like many hyped up technology trends the trough can be deep and long as project failures accumulate, or for ‘hot’ trends that evolve and mature quickly the trough can be shallow and short, leading to broader and rapid adoption. Is the big data hype failing to deliver on its promise of increased revenue and competitive advantage for companies that leverage big data to introduce new products and services and improve business operations? Why is it that some big data projects fail to deliver on their promise? Svetlana Sicular, Research Director, Gartner points out in her blog Big Data is Falling into the Trough of Disillusionment that, “These [advanced client] organizations have fascinating ideas, but they are disappointed with a difficulty of figuring out reliable solutions.” There are several reasons why big data projects may fail to deliver on their promise: (more…)