Tag Archives: partitioning
Data warehouses tend to grow very quickly because they integrate data from multiple sources and maintain years of historical data for analytics. A number of our customers have data warehouses in the hundreds of terabytes to petabytes range. Managing such a large amount of data becomes a challenge. How do you curb runaway costs in such an environment? Completing maintenance tasks within the prescribed window and ensuring acceptable performance are also big challenges.
We have provided best practices to archive aged data from data warehouses. Archiving data will keep the production data size at almost a constant level, reducing infrastructure and maintenance costs, while keeping performance up. At the same time, you can still access the archived data directly if you really need to from any reporting tool. Yet many are loath to move data out of their production system. This year, at Informatica World, we’re going to discuss another method of managing data growth without moving data out of the production data warehouse. I’m not going to tell you what this new method is, yet. You’ll have to come and learn more about it at my breakout session at Informatica World: What’s New from Informatica to Improve Data Warehouse Performance and Lower Costs.
I look forward to seeing all of you at Aria, Las Vegas next month. Also, I am especially excited to see our ILM customers at our second Product Advisory Council again this year.
Columnar Deduplication and Column Tokenization: Improving Database Performance, Security and Interoperability
For some time now, a special technique called columnar deduplication has been implemented by a number of commercially available relational database management systems. In today’s blog post, I discuss the nature and benefits of this technique, which I will refer to as column tokenization for reasons that will become evident.
Column tokenization is a process in which a unique identifier (called a Token ID) is assigned to each unique value in a column, and then employed to represent that value anywhere it appears in the column. Using this approach, data size reductions of up to 50% can be achieved, depending on the number of unique values in the column (that is, on the column’s cardinality). Some RDBMSs use this technique simply as a way of compressing data; the column tokenization process is integrated into the buffer and I/O subsystems, and when a query is executed, each row needs to be materialized and the token IDs replaced by their corresponding values. At Informatica for the File Archive Service (FAS) part of the Information Lifecycle Management product family, column tokenization is the core of our technology: the tokenized structure is actually used during query execution, with row materialization occurring only when the final result set is returned. We also use special compression algorithms to achieve further size reduction, typically on the order of 95%.
The Oracle Application User Group Archive and Purge Special Interest Group held its semi-annual meeting on Sunday September 30th at Oracle OpenWorld. Once again, this session was very well attended – but more so this year because of the expert panel which included: Admed Alomari – Founder of Cybermoor, Isam Alyousfi – Seniorr Director, Lead Oracle Applications Tuning Group, Sameer Barakat, Oracle Applications Tuning Group, and Ziyad Dahbour – now at Informatica (Founder of TierData, Founder of Outerbay). (more…)
Thousands of Oracle OpenWorld 2012 attendees visited the Informatica booth to learn how to leverage their combined investments in Oracle and Informatica technology. Informatica delivered over 40 presentations on topics that ranged from cloud, to data security to smart partitioning. Key Informatica executives and experts, from product engineering and product management, spoke with hundreds of users on topics and answered questions on how Informatica can help them improve Oracle application performance, lower risk and costs, and reduce project timelines. (more…)
Alternative Methods of Managing Data Growth and Best Practices for Using Them as Part of an Enterprise Information Lifecycle Management Strategy
Data, either manually created, or machine generated, tend to live on forever, because people hold on to it for fear that they might lose information by destroying data.
There is a saying in Bhagavad Gita:
jaathasya hi dhruvo mr.thyur dhr.uvam janma mr.thasya cha |
thasmaad aparihaarye’rthe’ na thvam sochithum-arhasi ||
“For death is certain to one who is born; to one who is dead, birth is certain; therefore, thou shalt not grieve for what is unavoidable.” (more…)
Both partitioning and archiving are alternative methods of improving database and application performance. Depending on a database administrator’s comfort level for one technology or method over another, either partitioning or archiving could be implemented to address performance issues due to data growth in production applications. But what are the best practices for utilizing one or the other method and how can they be used better together?
Database partitioning and database archiving are both methods for improving application performance. Many IT organizations use one or the other, but using them together can provide additional incremental value to an organization.
Database partitioning is a well-known method to DBAs and is supported by most of the commercially available databases. The benefits of partitioning include: (more…)
In one of my earlier blogs, I wrote about why you still need database archiving, when you already partition your database. On a similar vein, many people also ask me why you still need to archive when you already have database compression to reduce your storage capacity and cost. The benefits of archiving, which you can’t achieve with just compression and/or partitioning are still the same:
- Archiving allows you to completely move data volumes out of the production system to improve response time and reduce infrastructure costs. Why keep unused data, even if compressed, on high cost server infrastructure when you don’t need to? Why add overhead to query processing when you can remove the data from being processed at all?
- Avoid server and software license upgrades. By removing inactive data from the database, you no longer require as much processing power and you can keep your existing server without having to add CPU cores and additional licenses for your database and application. This further eliminates costs.
- Reduce overall administration and maintenance costs. If you still keep unused data around in your production system, you still need to back it up, replicate it for high availability, clone it for non-production copies, recover it in the event of a disaster, upgrade it, organize and partition it, and consider it as part of your performance tuning strategy. Yes, it will take less time to backup, copy, restore, etc., since the data is compressed and is smaller, but why even include that data as part of production maintenance activities at all, if it’s infrequently used?
- Remove the multiplier effect. The cost of additional data volume in production systems is multiplied when you consider how many copies you have of that production data in mirrors, backups, clones, non-production systems, and reporting warehouses. The size multiplier is less since the data is compressed, but it’s still more wasted capacity in multiple locations. Not to mention the additional server, software license, and maintenance costs associated with the additional volumes in those multiple copies. So it’s best to just remove that data size at the source.
- Ensure compliance by enforcing retention and disposition policies. As I discussed in my previous blog on the difference between archiving and backup, archiving is the solution for long term data retention. Archiving solutions, such as Informatica Data Archive, have integration points with records management software or provide built-in retention management to enforce the retention of data for a specified period based on policies. During that period, the immutability and authenticity of the archived data is ensured, and when the retention period expires, records are automatically purged after the appropriate review and approval process. Regulated data needs to be retained long enough to comply with regulations, but keeping data for too long can also become a legal liability. So it’s important that expired records are purged in a timely manner. Just keeping data in production databases indefinitely doesn’t help you to reduce your compliance and legal risks.
Implementing enterprise application and database archiving is just plain best practices. The best way to improve performance and reduce infrastructure and maintenance costs is to reduce the data volume in your production systems. Why increase overhead when you don’t have to? Today’s archiving solutions allow you to maintain easy access to the data after archival, so there is no reason to keep data around just for the sake of accessibility. By moving inactive but regulated data to a central archival store, you can uniformly enforce retention policies. At the same time, you can reduce the time and cost of eDiscovery by making all types of data centrally and easily searchable.
Many people ask me what additional benefits, can archiving provide when you already partition your database. The answer is two-fold:
- Archiving allows you to eliminate more data volumes from being processed to further improve response time.
- Partitioning doesn’t necessarily reduce your backup window. The only way to shorten your backup window is to remove data from your production database by archiving (or purging) it to another location.
Let me expand on these points. (more…)