How Long Data Should be Kept in Warehouses: Two Sides of “Indefinite” Retention
*Original article by the Author is posted here, Sandhill.com
Everyone knows that in the United States you are supposed to keep your tax data for seven years just in case you get audited. But is there a hard and fast rule for how long you should keep your business data or personal data? This is a tougher and tougher question now that data has proliferated and is much more challenging to manage on a personal and professional level.
With data being the obsession of business executives, entrepreneurs and IT technology investors, there’s a justification to want and store increasing amounts of data. Big data is opening incredible new opportunities and the promise of potential insight is just too luring to let the data go. Yet, some would claim the economics of storing data has shifted from a matter of cost to one of risk. Do we have an obligation to retain data that could be used to improve humanity? And if the answer is yes, how long do we need to keep it? Or should we destroy data after it has served its usefulness to reduce the effects of potential breaches? How do we know that the data has been safely disposed of so malicious intruders cannot get their hands on it?
Economics of long-term data retention
Let’s consider the economics of storing massive volumes of data for longer periods of time. If you search for “the fully burdened cost of disk storage,” you will find several references  in 2009 citing $25 per gigabyte (or $25,000 per terabyte) per month for on-premises storage. The total cost of ownership (TCO) of 10TB over the last five years is estimated to be approximately $100 million. Today, Amazon Web Services lists 10TB of standard storage for about $300/month.
Now apply economics Jevons Paradox to cloud storage, or computing in general. Reducing the cost of storage creates new opportunities for use. Add low cost, distributed compute power to that low cost, commodity disk, sprinkle in some open source software, such as Hadoop, and voila! You have a big data market expected to reach $50 billion by 2020 .
Information Lifecycle Management
Does anyone remember the term Information Lifecycle Management, or ILM for short? Major storage vendors promoted ILM strategies in 2004 to identify ways storage administrators could lower the total cost of storage by introducing tiers of storage. Mission-critical data could be stored on highly available, redundant technology while aging data could be stored on lower-cost storage with lower service levels. Aligning the investment in infrastructure with the value of data to the business lowers total cost.
Applying ILM requires implementing a data classification system that allows you to tag data sets based on metadata and business rules, and monitor changes of each throughout the passage of time. Business glossary and metadata management tools that are integrated with data integration and migration technologies are quite useful to automate the movement of data from transaction processing databases to the data warehouse to analytical databases and to open source platforms such as Hadoop.
Not that anyone needs a history lesson; however, it sets an important context in the advent of big data, the role of the data warehouse and the topic of data retention. Rather than focusing on the cost, Cloudera founder, Amr Awadallah, writes about how Hadoop can be used as an Active Archive in his blog on big data’s New Use Cases. He points out that not all data deserves a “first-class ticket” in an analytical database. Detailed, granular data can be stored cost effectively holding an “economy class” seat in Hadoop.
Argument for keeping data indefinitely
Data scientists are now able to affordably keep reams and reams of detailed, granular historical data for practically nothing, which is critical since the amount of historical data will dwarf the volume of future data that is to come. Marc Benioff, CEO of Salesforce.com, stated at Dreamforce 2014 that “90 percent of the world’s data was created in the last two years.” Can you imagine what is to come?!
Historical data gives context when in search of patterns. When testing a hypothesis, full sets of data are extremely valuable when testing predictive models. Storage is no longer a cost concern. In fact, when you look at the adoption of Hadoop and some of its use cases or how DataKind is using data to serve humanity, the argument for keeping data indefinitely is quite profound. We are making a difference in our world – commercially and medically – with data.
Argument for enforcing limited data retention
While advocates of democratizing data for the sake of insight via analytics want data available forever, there is a darker side to the story. As data volumes are growing, so are the number of incidents and severity of data theft and cybercrime. Chief data officers want liberal access to data; chief information security officers want to protect data from foul play and innocent mistakes. Risk officers and records retention managers are weighing pros and cons of their current data and records retention schedules for this reason.
Healthcare.gov manages a government data warehouse that stores millions of personal records. Currently, this system, known as MIDAS, has a retention period of “indefinite.” According to an article by the Associated Press, this is raising serious concerns given the type of information residing in the data warehouse – information that could be used for identity theft, insurance fraud and falsified tax claims.
This year, the White House appointed its first Chief Data Scientist, DJ Patel; one of his initial posts will be to work on the Administration’s Precision Medicine Initiative. More than one million Americans will be asked to volunteer to contribute their health data. Predictive models will use individuals’ conditions and genetic makeup to determine better, more precise, personalized treatments. If individuals are asked to volunteer their data, they should also have a say on how long their data is kept in these research data stores – or at least have some assurances that specific data fields are de-identified when used in analytics.
Clearly there are two sides to the argument for keeping data indefinitely. It ultimately comes down to the type of data retained, the purpose or use case for retaining it and the risk of exposure should the data be breached.
As the number of data sources and volumes increase, keeping track of data you have becomes an extremely difficult task – especially as data moves to and from the cloud without requiring help from IT or watchful eye of your security teams. In a recent research report conducted by The Ponemon Institute on behalf of Informatica, more than 50 percent of respondents stated that the top thing that keeps security practitioners up at night is “not knowing where sensitive and confidential data is.” It makes the difficult impossible if you can’t protect something that you don’t know where it is.
Best practices incorporate a compromise that follows a similar philosophy as Information Lifecycle Management. At the foundation is a governance process that incorporates a data classification discipline. Once you know what data is considered sensitive or confidential and where it resides, you can make precise investments in data security technologies that are justified. As data ages, it may make sense to implement data de-identification, or data masking, in analytics environments to mitigate or even eliminate risk.
While there is a desire to retain data for potential future use as systems age, another option making the rounds and gaining popularity again is the use of archiving technology. By moving sensitive and regulated data to a common, centralized, highly compressed and secure data store, you reduce the potential risk while incorporating fine-grained attribute-based access controls.
I think we can all agree that data volumes will continue to grow and the unique number of sources will likely increase. Looking back at how we stored information 10 years ago should give you an idea of how much we should expect things to change in the next 10 years. Incorporating data classification and retention management into enterprise data architectures, and resurrecting the good old days of ILM, will certainly improve our ability to make compromises in the future that yield the best mix of advantages and reduced risk. Maybe this time around, we can make ILM seamless or invisible so that it just becomes part of the data-management platform.
While the IRS might be right that seven years is sufficient for keeping tax data, with the direction we are going with the volumes of data increasing exponentially and the cost to store and keep that data safe, IT might just need a 70-year rule for keeping enterprise data!