Tag Archives: data warehouse archiving
Certainly, it is easy to see how it would be preferable to manage a database that is 5 TB rather than 40 TB in size, particularly when it comes to critical tasks like backup and recovery, disaster recovery, off-site backups and historical analytics. Today, however, I want to focus on another benefit of Informatica Data Vault that is less obvious but still very important: data modeling flexibility for data warehouses and data marts.Informatica Data Vault permits organizations to keep a much greater amount of useful data accessible, without requiring compromises on SLAs, TCO and reporting performance. This in turn makes a variety of flexible data modeling options available.
The Physical Table Partitioning Model
The first of these new data modeling options is based on physical table partitioning. The largest tables in a data warehouse or data mart can be physically divided between an online component and the archive counterpart. This allows the existing data model to be maintained, while introducing a “right-sizing” concept where only the regularly accessed data is kept online, and all data that doesn’t require such an expensive and/or hard to manage environment is put into the Informatica Data Vault solution. A typical rule of thumb for defining partition boundaries for data warehouses is based on the 90-day aging principle, so that any static data older than 90 days is migrated from the online warehouse to the Informatica Data Vault repository.
Now, many forms of enterprise data, such as CDR, POS, Web, Proxy or Log data, are static by definition, and are furthermore usually the main sources of data warehouse growth. This is very good news, because it means that as soon as the data is captured, it can be moved to the Informatica Data Vault store (in fact, it is conceivable that this kind of data could be fed directly to Informatica Data Vault from the source system – but that is a topic for another post). Because of the large volumes involved, this kind of detail data has usually been aggregated at one or more levels in the enterprise data warehouse. Users generally query the summary table in order to identify trends, only drilling down into the details for a specific range of records when specific needs or opportunities are identified. This data access technique is well known, and has been in use for quite some time.
The Online Summary Table Model
This leads me to the second novel design option offered by Informatica Data Vault : the ability to store all static detail data in the archive store, and then use this as the basis for building online summary tables, with the ability to quickly drill to detail in the Informatica Data Vault when required. More specifically, the Informatica Data Vault can be used to feed the online system’s summary tables directly because the data structures and SQL access remain intact. The advantage of this implementation is that it substantially reduces the size of the online database, optimizes its performance, and permits trend analysis on even very long periods. This is particularly useful when looking for emerging trends (positive or negative) related to specific products or offerings, because it gives managers the chance to analyze and respond to issues and opportunities within a realistic time frame.
Some organizations are already building this type of data hierarchy, using Data Marts or analytic cubes fed by the main Data Warehouse. I call this kind of architecture “data pipelining”. Informatica Data Vault can play an important role in such an implementation, since its repository can be shared between all the analytic platforms. This not only reduces data duplication, management/operational overhead, and requirements for additional hardware and software, it also relieves pressure on batch windows and lowers the risk of data being out of synch. Furthermore, this implementation can assist organizations with data governance and Master Data Management while improving overall data quality.
The Just-In-Case Data Model
Another important data modeling option offered by Informatica Data Vault relates to what we can call “just-in-case” data. In many cases, certain kinds of data will also be maintained outside the warehouse just in case an analyst requires ad hoc access to it for a specific study. Sometimes, for convenience, this “exceptional” data is stored in the data warehouse. However, keeping this data in an expensive storage and software environment, or even storing it on tape or inexpensive disks as independent files, can create a data management nightmare. At the same time, studies demonstrate that a very large portion of the costs associated with ad hoc analysis are concentrated in the data preparation phase. As part of this phase, the analyst needs to “shop” for the just-in-case data to be analyzed, meaning that he or she needs to find, “slice”, clean, transform and use it to build a temporary analytic platform, sometimes known as an Exploration Warehouse or “Exploration Mart.
Informatica Data Vault can play a very important role in such a scenario. Just-in-case data can be stored in the archive store, and analysts can then query it directly using standard SQL-based front-end tools to extract, slice and prepare the data for analytic use. Since much less time is spent on data preparation, far more time is available for data analysis — and there is no impact on the performance of the main reporting system. This acceleration of the data preparation phase results from the availability of a central catalog describing all the available data. The archive repository can be used to directly feed the expert’s preferred analytic platform, generally resulting in a substantial improvement in analyst productivity. Analysts can focus on executing their analyses, and on bringing more value to the enterprise, rather than on struggling to get access to clean and reliable data.
Data warehouses are applications– so why not manage them like one? In fact, data grows at a much faster rate in data warehouses, since they integrate date from multiple applications and cater to many different groups of users who need different types of analysis. Data warehouses also keep historical data for a long time, so data grows exponentially in these systems. The infrastructure costs in data warehouses also escalate quickly since analytical processing on large amounts of data requires big beefy boxes. Not to mention the software license and maintenance costs of such a large amount of data. Imagine how many backup media is required to backup tens to hundreds of terabytes of data warehouses on a regular basis. But do you really need to keep all that historical data in production?
One of the challenges of managing data growth in data warehouses is that it’s hard to determine which data is actually used, which data is no longer being used, or even if the data was ever used at all. Unlike transactional systems where the application logic determines when records are no longer being transacted upon, the usage of analytical data in data warehouses has no definite business rules. Age or seasonality may determine data usage in data warehouses, but business users are usually loath to let go of the availability of all that data at their fingertips. The only clear cut way to prove that some data is no longer being used in data warehouses is to monitor its usage.
In one of my earlier blogs, I wrote about why you still need database archiving, when you already partition your database. On a similar vein, many people also ask me why you still need to archive when you already have database compression to reduce your storage capacity and cost. The benefits of archiving, which you can’t achieve with just compression and/or partitioning are still the same:
- Archiving allows you to completely move data volumes out of the production system to improve response time and reduce infrastructure costs. Why keep unused data, even if compressed, on high cost server infrastructure when you don’t need to? Why add overhead to query processing when you can remove the data from being processed at all?
- Avoid server and software license upgrades. By removing inactive data from the database, you no longer require as much processing power and you can keep your existing server without having to add CPU cores and additional licenses for your database and application. This further eliminates costs.
- Reduce overall administration and maintenance costs. If you still keep unused data around in your production system, you still need to back it up, replicate it for high availability, clone it for non-production copies, recover it in the event of a disaster, upgrade it, organize and partition it, and consider it as part of your performance tuning strategy. Yes, it will take less time to backup, copy, restore, etc., since the data is compressed and is smaller, but why even include that data as part of production maintenance activities at all, if it’s infrequently used?
- Remove the multiplier effect. The cost of additional data volume in production systems is multiplied when you consider how many copies you have of that production data in mirrors, backups, clones, non-production systems, and reporting warehouses. The size multiplier is less since the data is compressed, but it’s still more wasted capacity in multiple locations. Not to mention the additional server, software license, and maintenance costs associated with the additional volumes in those multiple copies. So it’s best to just remove that data size at the source.
- Ensure compliance by enforcing retention and disposition policies. As I discussed in my previous blog on the difference between archiving and backup, archiving is the solution for long term data retention. Archiving solutions, such as Informatica Data Archive, have integration points with records management software or provide built-in retention management to enforce the retention of data for a specified period based on policies. During that period, the immutability and authenticity of the archived data is ensured, and when the retention period expires, records are automatically purged after the appropriate review and approval process. Regulated data needs to be retained long enough to comply with regulations, but keeping data for too long can also become a legal liability. So it’s important that expired records are purged in a timely manner. Just keeping data in production databases indefinitely doesn’t help you to reduce your compliance and legal risks.
Implementing enterprise application and database archiving is just plain best practices. The best way to improve performance and reduce infrastructure and maintenance costs is to reduce the data volume in your production systems. Why increase overhead when you don’t have to? Today’s archiving solutions allow you to maintain easy access to the data after archival, so there is no reason to keep data around just for the sake of accessibility. By moving inactive but regulated data to a central archival store, you can uniformly enforce retention policies. At the same time, you can reduce the time and cost of eDiscovery by making all types of data centrally and easily searchable.