Data warehouses are applications– so why not manage them like one? In fact, data grows at a much faster rate in data warehouses, since they integrate date from multiple applications and cater to many different groups of users who need different types of analysis. Data warehouses also keep historical data for a long time, so data grows exponentially in these systems. The infrastructure costs in data warehouses also escalate quickly since analytical processing on large amounts of data requires big beefy boxes. Not to mention the software license and maintenance costs of such a large amount of data. Imagine how many backup media is required to backup tens to hundreds of terabytes of data warehouses on a regular basis. But do you really need to keep all that historical data in production?
One of the challenges of managing data growth in data warehouses is that it’s hard to determine which data is actually used, which data is no longer being used, or even if the data was ever used at all. Unlike transactional systems where the application logic determines when records are no longer being transacted upon, the usage of analytical data in data warehouses has no definite business rules. Age or seasonality may determine data usage in data warehouses, but business users are usually loath to let go of the availability of all that data at their fingertips. The only clear cut way to prove that some data is no longer being used in data warehouses is to monitor its usage.
Monitoring data usage is not that simple. Regular database monitoring tools will only give you query level monitoring. However, to better understand how data is being used, you need to monitor usage not only at the database level, but at the application level. For data warehouses, reports from business intelligence (BI) tools are a representative sample of how data is being used. You can also get a good understanding of which users, groups and departments are actually using the data and for what purposes and frequency, from these BI tools. So deploying a data usage monitoring tool that can track usage, not only at the database level, but at the BI tool and business user level is critical to determine how data is being used and to prioritize what data needs to be kept in the production data warehouse.
Once you have identified which data is no longer being used and which data was never used at all, then you need to take action. For those data that are no longer being used and you need to keep for regulatory compliance, you should archive them. For those data that you will never use again, you need to purge them. And for the information that is never used, you should remove them from regular data warehouse loads. Deploying a data warehouse archiving solution to manage data growth in data warehouses helps you to automatically move data that has been identified as dormant by your data usage monitoring tool, to a lower cost infrastructure or purge them entirely if they will never be used. Some archiving solutions also help you to automate the right partitioning strategy if you’re not ready to move the data out of production so that you can better optimize data warehouse queries. If data needs to be retained for compliance, then an archiving solution will further help you to enforce retention and disposal policies.
By monitoring and understanding how data is being used in your data warehouses and then taking action based on that information, you can reap the benefits of proactively managing data growth:
- Minimize overall infrastructure costs
- Reduce maintenance efforts
- Improve performance
- Ensure compliance to retention regulations
Including data usage monitoring and data warehouse archiving as part of your data warehouse management strategy just makes practical business sense. Especially if you’re managing data growth in your enterprise applications today by archiving, why not do the same for your data warehouses?