Tag Archives: data warehouse archiving
Data warehouses are applications– so why not manage them like one? In fact, data grows at a much faster rate in data warehouses, since they integrate date from multiple applications and cater to many different groups of users who need different types of analysis. Data warehouses also keep historical data for a long time, so data grows exponentially in these systems. The infrastructure costs in data warehouses also escalate quickly since analytical processing on large amounts of data requires big beefy boxes. Not to mention the software license and maintenance costs of such a large amount of data. Imagine how many backup media is required to backup tens to hundreds of terabytes of data warehouses on a regular basis. But do you really need to keep all that historical data in production?
One of the challenges of managing data growth in data warehouses is that it’s hard to determine which data is actually used, which data is no longer being used, or even if the data was ever used at all. Unlike transactional systems where the application logic determines when records are no longer being transacted upon, the usage of analytical data in data warehouses has no definite business rules. Age or seasonality may determine data usage in data warehouses, but business users are usually loath to let go of the availability of all that data at their fingertips. The only clear cut way to prove that some data is no longer being used in data warehouses is to monitor its usage.
In one of my earlier blogs, I wrote about why you still need database archiving, when you already partition your database. On a similar vein, many people also ask me why you still need to archive when you already have database compression to reduce your storage capacity and cost. The benefits of archiving, which you can’t achieve with just compression and/or partitioning are still the same:
- Archiving allows you to completely move data volumes out of the production system to improve response time and reduce infrastructure costs. Why keep unused data, even if compressed, on high cost server infrastructure when you don’t need to? Why add overhead to query processing when you can remove the data from being processed at all?
- Avoid server and software license upgrades. By removing inactive data from the database, you no longer require as much processing power and you can keep your existing server without having to add CPU cores and additional licenses for your database and application. This further eliminates costs.
- Reduce overall administration and maintenance costs. If you still keep unused data around in your production system, you still need to back it up, replicate it for high availability, clone it for non-production copies, recover it in the event of a disaster, upgrade it, organize and partition it, and consider it as part of your performance tuning strategy. Yes, it will take less time to backup, copy, restore, etc., since the data is compressed and is smaller, but why even include that data as part of production maintenance activities at all, if it’s infrequently used?
- Remove the multiplier effect. The cost of additional data volume in production systems is multiplied when you consider how many copies you have of that production data in mirrors, backups, clones, non-production systems, and reporting warehouses. The size multiplier is less since the data is compressed, but it’s still more wasted capacity in multiple locations. Not to mention the additional server, software license, and maintenance costs associated with the additional volumes in those multiple copies. So it’s best to just remove that data size at the source.
- Ensure compliance by enforcing retention and disposition policies. As I discussed in my previous blog on the difference between archiving and backup, archiving is the solution for long term data retention. Archiving solutions, such as Informatica Data Archive, have integration points with records management software or provide built-in retention management to enforce the retention of data for a specified period based on policies. During that period, the immutability and authenticity of the archived data is ensured, and when the retention period expires, records are automatically purged after the appropriate review and approval process. Regulated data needs to be retained long enough to comply with regulations, but keeping data for too long can also become a legal liability. So it’s important that expired records are purged in a timely manner. Just keeping data in production databases indefinitely doesn’t help you to reduce your compliance and legal risks.
Implementing enterprise application and database archiving is just plain best practices. The best way to improve performance and reduce infrastructure and maintenance costs is to reduce the data volume in your production systems. Why increase overhead when you don’t have to? Today’s archiving solutions allow you to maintain easy access to the data after archival, so there is no reason to keep data around just for the sake of accessibility. By moving inactive but regulated data to a central archival store, you can uniformly enforce retention policies. At the same time, you can reduce the time and cost of eDiscovery by making all types of data centrally and easily searchable.