Data, either manually created, or machine generated, tend to live on forever, because people hold on to it for fear that they might lose information by destroying data.
There is a saying in Bhagavad Gita:
jaathasya hi dhruvo mr.thyur dhr.uvam janma mr.thasya cha |
thasmaad aparihaarye’rthe’ na thvam sochithum-arhasi ||
“For death is certain to one who is born; to one who is dead, birth is certain; therefore, thou shalt not grieve for what is unavoidable.“
Mankind seems to love to preserve whatever it has created forever. However, this can be a very expensive proposition when it comes to keeping data around in perpetuity in enterprise’s IT departments.
Data growth occurs due to various reasons, including mergers and acquisitions, business initiatives introducing new applications, new IT initiatives and added delivery channels such as web and mobile devices. IT also needs to support the creation of new derived data instances, every time a new analytical application is built. Also, for every copy of production data, there might be multiple copies created in development and testing environments creating a multiplier effect.
As data volumes grow, cost increases year after year. This cost includes hard costs such as storage and server hardware, along with the associated maintenance contracts and software licenses. Additionally, there is the soft cost such as the administrative cost associated with backup and recovery, tuning, design, data movement, master data management, and more.
With data growing exponentially today’s organization have realized the diminishing value of data with time. The inverse correlation between the age and value of data presents a big opportunity for cost savings to IT departments.
The opportunity is to be able to manage and maintain data with different value in different ways with different cost structures, based on the access frequency and performance requirements. There are multiple alternative methods of managing data growth, improving performance, and saving costs, based on the understanding of how data is accessed.
- You can first partition data based on your query pattern and frequency, to improve performance. It may be a good idea to look beyond the underlying database partitioning capabilities here, since there are tools to help make the maintenance of partitions easier by providing automation, rather than manual scripting, as well as automate more complex partitioning based on related tables or entities, not just individual tables.
- Once you no longer use the data frequently you can start archiving it to another database instance with lower cost infrastructure, under a different level of maintenance, while maintaining seamless access to the combined production and archived data from the original application interface. By moving data out of the production system, you reduce the size of the production database, thereby improving response time and avoiding additional infrastructure and license costs. If you have partitioned the data in the first place, then you can optimize the archiving process by archiving entire older partitions, instead of moving and purging the production data record by record. Moving inactive data to another database instance is ideal for archiving data that’s less frequently accessed, but still requires relatively high performance access from the same original application interface.
- As the data becomes rarely accessed, data can be archived to an optimized file archive that significantly reduces storage capacity requirements through compression (up to 98% compression is possible). The data can still be easily accessed from any reporting tool. The tradeoff between archiving to database vs. to the optimized file archive is access performance vs. extreme storage reduction and guaranteed immutability of the data.
IT organizations can choose to implement any combination of these methods of managing data growth and get the best of both worlds of keeping the data for as long as required but reducing the cost required to retain the data for a longer term.