Certainly, it is easy to see how it would be preferable to manage a database that is 5 TB rather than 40 TB in size, particularly when it comes to critical tasks like backup and recovery, disaster recovery, off-site backups and historical analytics. Today, however, I want to focus on another benefit of Informatica Data Vault that is less obvious but still very important: data modeling flexibility for data warehouses and data marts.Informatica Data Vault permits organizations to keep a much greater amount of useful data accessible, without requiring compromises on SLAs, TCO and reporting performance. This in turn makes a variety of flexible data modeling options available.
The Physical Table Partitioning Model
The first of these new data modeling options is based on physical table partitioning. The largest tables in a data warehouse or data mart can be physically divided between an online component and the archive counterpart. This allows the existing data model to be maintained, while introducing a “right-sizing” concept where only the regularly accessed data is kept online, and all data that doesn’t require such an expensive and/or hard to manage environment is put into the Informatica Data Vault solution. A typical rule of thumb for defining partition boundaries for data warehouses is based on the 90-day aging principle, so that any static data older than 90 days is migrated from the online warehouse to the Informatica Data Vault repository.
Now, many forms of enterprise data, such as CDR, POS, Web, Proxy or Log data, are static by definition, and are furthermore usually the main sources of data warehouse growth. This is very good news, because it means that as soon as the data is captured, it can be moved to the Informatica Data Vault store (in fact, it is conceivable that this kind of data could be fed directly to Informatica Data Vault from the source system – but that is a topic for another post). Because of the large volumes involved, this kind of detail data has usually been aggregated at one or more levels in the enterprise data warehouse. Users generally query the summary table in order to identify trends, only drilling down into the details for a specific range of records when specific needs or opportunities are identified. This data access technique is well known, and has been in use for quite some time.
The Online Summary Table Model
This leads me to the second novel design option offered by Informatica Data Vault : the ability to store all static detail data in the archive store, and then use this as the basis for building online summary tables, with the ability to quickly drill to detail in the Informatica Data Vault when required. More specifically, the Informatica Data Vault can be used to feed the online system’s summary tables directly because the data structures and SQL access remain intact. The advantage of this implementation is that it substantially reduces the size of the online database, optimizes its performance, and permits trend analysis on even very long periods. This is particularly useful when looking for emerging trends (positive or negative) related to specific products or offerings, because it gives managers the chance to analyze and respond to issues and opportunities within a realistic time frame.
Some organizations are already building this type of data hierarchy, using Data Marts or analytic cubes fed by the main Data Warehouse. I call this kind of architecture “data pipelining”. Informatica Data Vault can play an important role in such an implementation, since its repository can be shared between all the analytic platforms. This not only reduces data duplication, management/operational overhead, and requirements for additional hardware and software, it also relieves pressure on batch windows and lowers the risk of data being out of synch. Furthermore, this implementation can assist organizations with data governance and Master Data Management while improving overall data quality.
The Just-In-Case Data Model
Another important data modeling option offered by Informatica Data Vault relates to what we can call “just-in-case” data. In many cases, certain kinds of data will also be maintained outside the warehouse just in case an analyst requires ad hoc access to it for a specific study. Sometimes, for convenience, this “exceptional” data is stored in the data warehouse. However, keeping this data in an expensive storage and software environment, or even storing it on tape or inexpensive disks as independent files, can create a data management nightmare. At the same time, studies demonstrate that a very large portion of the costs associated with ad hoc analysis are concentrated in the data preparation phase. As part of this phase, the analyst needs to “shop” for the just-in-case data to be analyzed, meaning that he or she needs to find, “slice”, clean, transform and use it to build a temporary analytic platform, sometimes known as an Exploration Warehouse or “Exploration Mart.
Informatica Data Vault can play a very important role in such a scenario. Just-in-case data can be stored in the archive store, and analysts can then query it directly using standard SQL-based front-end tools to extract, slice and prepare the data for analytic use. Since much less time is spent on data preparation, far more time is available for data analysis — and there is no impact on the performance of the main reporting system. This acceleration of the data preparation phase results from the availability of a central catalog describing all the available data. The archive repository can be used to directly feed the expert’s preferred analytic platform, generally resulting in a substantial improvement in analyst productivity. Analysts can focus on executing their analyses, and on bringing more value to the enterprise, rather than on struggling to get access to clean and reliable data.