Don’t Forget to Manage the Retention and Disposal of Data on Hadoop

According to an article written by Mark Brunelli interviewing James Kobielus of Forrester Research: Forrester’s Kobielus: It’s time for a Hadoop standards body, Hadoop is still a bit immature and needs adoption of standards. Mr. Kobielus goes on to indicate that when implementing Hadoop, “whether it’s through a data warehouse or Hadoop cluster, you’re talking about petabytes or multiple hundreds of terabytes worth of storage.”  Hadoop, while designed to access these large data volumes (which can include social media data), does nothing to manage retention of that data.

The “Big Data” aspect of Hadoop has pros and cons. Having lots of data (especially social media) is great for many things, like helping to more accurately target your marketing dollars. However, it comes with a price. Big data volumes combined with explosive data growth create a growing mass of information that can fall under government and industry regulations requiring retention management. Failure to implement records retention policies can result in non-compliance costing very large, steep fines. The courts are looking to more sources of digital information to support the legal process. As technology progresses and develops new ways of data storage, the courts will expand their requirements to incorporate these new data sources – such as data stored on Hadoop clusters.  As the courts require you to produce this data for compliance reasons, it falls on you, the owner of the data to produce it in a timely manner. This means you need to have it, know where it is and have access to it.

Data can hurt you if stored past its retention policy. Once a legal retention policy has expired, you can legally delete the data. If you do not delete the data and a legal or compliance issue comes up that involves the expired data, you must produce it if you have it.  If this data is damaging to your case, it can cost you millions. Had you deleted the data immediately after it expired, you would have avoided the fine or the court settlement.  That is why it is important not only to have data retention policies, but also data disposal policies as part of your records retention process.

Implementing retention management is not simple, but there are some basic steps to follow. It is important that you inventory your government and industry regulations as they apply to data retention (it seems they keep adding regulations). Then you need to classify your data to know which data is covered by those regulations and where it is stored. Next, you need to establish retention policies based on the regulations in combination with business requirements. Add this to an already complex myriad of data sources within your environment, and it may seem a daunting task. To make it a little more complex, you need to make sure data is not deleted if a legal process has begun and the expiration of the involved data is imminent. You have to find some way of retaining the data until the legal process has completed, and then it can be deleted. Implementing Hadoop may provide benefits for storing and analyzing Big Data, but you need to remember that the retention policies of that data are subject to the same regulations as any other data in your enterprise. Don’t forget to incorporate retention and disposal policy management in your Hadoop implementations It may add some cost and time to the actual implementation, but could save you millions down the road in steep government fines.

This entry was posted in Application ILM, Application Retirement, Big Data, Data Governance, Database Archiving, Financial Services, Governance, Risk and Compliance and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>