Certainly, it is easy to see how it would be preferable to manage a database that is 5 TB rather than 40 TB in size, particularly when it comes to critical tasks like backup and recovery, disaster recovery, off-site backups and historical analytics. Today, however, I want to focus on another benefit of Informatica Data Vault that is less obvious but still very important: data modeling flexibility for data warehouses and data marts.Informatica Data Vault permits organizations to keep a much greater amount of useful data accessible, without requiring compromises on SLAs, TCO and reporting performance. This in turn makes a variety of flexible data modeling options available.
The Physical Table Partitioning Model
The first of these new data modeling options is based on physical table partitioning. The largest tables in a data warehouse or data mart can be physically divided between an online component and the archive counterpart. This allows the existing data model to be maintained, while introducing a “right-sizing” concept where only the regularly accessed data is kept online, and all data that doesn’t require such an expensive and/or hard to manage environment is put into the Informatica Data Vault solution. A typical rule of thumb for defining partition boundaries for data warehouses is based on the 90-day aging principle, so that any static data older than 90 days is migrated from the online warehouse to the Informatica Data Vault repository.
Now, many forms of enterprise data, such as CDR, POS, Web, Proxy or Log data, are static by definition, and are furthermore usually the main sources of data warehouse growth. This is very good news, because it means that as soon as the data is captured, it can be moved to the Informatica Data Vault store (in fact, it is conceivable that this kind of data could be fed directly to Informatica Data Vault from the source system – but that is a topic for another post). Because of the large volumes involved, this kind of detail data has usually been aggregated at one or more levels in the enterprise data warehouse. Users generally query the summary table in order to identify trends, only drilling down into the details for a specific range of records when specific needs or opportunities are identified. This data access technique is well known, and has been in use for quite some time.
The Online Summary Table Model
This leads me to the second novel design option offered by Informatica Data Vault : the ability to store all static detail data in the archive store, and then use this as the basis for building online summary tables, with the ability to quickly drill to detail in the Informatica Data Vault when required. More specifically, the Informatica Data Vault can be used to feed the online system’s summary tables directly because the data structures and SQL access remain intact. The advantage of this implementation is that it substantially reduces the size of the online database, optimizes its performance, and permits trend analysis on even very long periods. This is particularly useful when looking for emerging trends (positive or negative) related to specific products or offerings, because it gives managers the chance to analyze and respond to issues and opportunities within a realistic time frame.
Some organizations are already building this type of data hierarchy, using Data Marts or analytic cubes fed by the main Data Warehouse. I call this kind of architecture “data pipelining”. Informatica Data Vault can play an important role in such an implementation, since its repository can be shared between all the analytic platforms. This not only reduces data duplication, management/operational overhead, and requirements for additional hardware and software, it also relieves pressure on batch windows and lowers the risk of data being out of synch. Furthermore, this implementation can assist organizations with data governance and Master Data Management while improving overall data quality.
The Just-In-Case Data Model
Another important data modeling option offered by Informatica Data Vault relates to what we can call “just-in-case” data. In many cases, certain kinds of data will also be maintained outside the warehouse just in case an analyst requires ad hoc access to it for a specific study. Sometimes, for convenience, this “exceptional” data is stored in the data warehouse. However, keeping this data in an expensive storage and software environment, or even storing it on tape or inexpensive disks as independent files, can create a data management nightmare. At the same time, studies demonstrate that a very large portion of the costs associated with ad hoc analysis are concentrated in the data preparation phase. As part of this phase, the analyst needs to “shop” for the just-in-case data to be analyzed, meaning that he or she needs to find, “slice”, clean, transform and use it to build a temporary analytic platform, sometimes known as an Exploration Warehouse or “Exploration Mart.
Informatica Data Vault can play a very important role in such a scenario. Just-in-case data can be stored in the archive store, and analysts can then query it directly using standard SQL-based front-end tools to extract, slice and prepare the data for analytic use. Since much less time is spent on data preparation, far more time is available for data analysis — and there is no impact on the performance of the main reporting system. This acceleration of the data preparation phase results from the availability of a central catalog describing all the available data. The archive repository can be used to directly feed the expert’s preferred analytic platform, generally resulting in a substantial improvement in analyst productivity. Analysts can focus on executing their analyses, and on bringing more value to the enterprise, rather than on struggling to get access to clean and reliable data.
In previous posts, we introduced the concept of the Informatica ILM Nearline and discussed how Informatica ILM Nearline could help your business. To recapitulate: the major advantage of Informatica ILM Nearline is its superior data access performance, which enables a more aggressive approach to migrating huge volumes of data out of the online repository to an accessible, highly compressed archive (on inexpensive 2nd and 3rd tier storage infrastructure).
Today, I will be considering the question of when an enterprise should consider implementing Informatica ILM Nearline. Broadly speaking, such implementations fall into two categories: they either offer a “cure” for an existing data management problem or represent a proactive implementation of data best practices within the organization.
Cure or Prevention?
The “cure” type of implementation is typically associated with a data warehouse or business application “rescue” project. This is undertaken when the production system grows to a point where database size causes major performance problems and affects the ability to meet Service Level Agreements (SLAs) and manage business processes in a timely manner. In these kinds of situation, it is mainly the operations division of the organization that is affected, and who demand an immediate fix that can take the form of an Informatica ILM Nearline implementation. The question here is: How quickly can the “cure” implementation stabilize performance and ensure satisfaction of SLAs?
On the other hand, the best practice approach, much like current practices related to healthy living, focuses on prevention rather than on curing. In this respect, best practices dictate that the Informatica ILM Nearline implementation should start as soon as some of the data in the production system becomes “infrequently accessed”, or “cold”. In data warehouses and data marts where the current month or two is being analyzed most often, this means data older than 90 days. For transactional systems the archiving cutoff may be a year or two, depending on typical length of your business processes. The main idea is to keep the size production databases from inflating for no good business reason and ‘nearlining’ the data as soon as possible without interrupting business operations or hurting the value of your data. Ultimately this should work to protect the enterprise from an operational crisis arising from deteriorating performance and unmet SLAs.
In order to better judge the impact of using either of these two approaches, it is important to understand the various steps involved in the “Nearlining” process. What do we find when we “dissect” the process of leveraging the Informatica ILM Nearline?
Dissecting the “Informatica ILM Nearline” Process
Informatica Informatica ILM Nearline involves multiple processes, whose performance characteristics can significantly influence the speed at which data is migrated out of the online database. The various processes are managed by the overall integrated nearline solution of Informatica coupled with a SAP Business Warehouse system:
- The first step is to lock the data that is targeted by the archiving process, in order to ensure that the data is not modified while the process is going on. SAP Business Warehouse does it automatically and you execute Data Archive Processes (DAP) for the cold data.
- Next comes the extraction of the data to be migrated. This is usually achieved via an SQL statement based on business rules for data migration. Often, the extraction can be performed using multiple extraction/consumer processes working in parallel.
- The next step is to secure the newly extracted data, so that it is recoverable.
- Then, the integrity of the extracted data must be validated (normally by comparing it to its online counterpart).
- Next, delete the online data that has been moved to nearline.
- Then, reorganize the tablespace of the deleted data.
- Finally, rebuild/reorganize the index associated with the online table from which data has been nearlined.
The Database Housekeeping process is often the slowest part of a Data Nearlining process, and thus can dictate the pace and scheduling of the implementation. In a production environment, the database housekeeping process is frequently decoupled from ongoing operations and performed over a weekend. It may be surprising to learn that deleting data can be a more expensive process than inserting it, but just ask an enterprise DBA about what is involved in deleting 1 TB from an Enterprise Data Warehouse and see what answer you get: for many, the task of fitting such a process into standard Batch Windows would be a nightmare.
So, it is easy to see that starting earlier in implementing Informatica ILM Nearline as a best practice can help to massively reduce not only the cost of the implementation, but also the time required to perform it. Therefore, the main recommendation to take away from this discussion is: Don’t wait too long to consider embarking on your Informatica ILM Nearline strategy!
That’s it for today. In my next post, I will take up the topic of which data should be initially considered as a candidate for migration.
In my last post, I discussed how our Informatica ILM Nearline allows vast amounts of detail data to be accessed at speeds that rival the performance of online systems, which in turn gives business analysts and application managers the power to assess and fine-tune important business initiatives on the basis of actual historical facts. We saw that the promise of Informatica ILM Nearline is basically to give you all the data you want, when and how you want it — without compromising the performance of existing data warehouse and business reporting systems.
Today, I want to consider what this capability means specifically for a business. What are the concrete benefits of implementing Informatica ILM Nearline? Here are a few of the most important ones.
Informatica ILM Nearline enables you to keep all your valuable data available for analysis.
Having more data accessible – more details, covering longer periods – enables a number of improvements in Business Intelligence processes:
- A clearer understanding of emerging trends in the business – what will go well in the future as well as what is now “going south”
- Better support for iterative analyses, enabling more intensive Business Performance Management (BPM)
- Better insight into customer behavior over the long term
- More precise target marketing, bringing a three- to five-fold improvement in campaign yield
Informatica ILM Nearline enables you to dramatically increase information storage and maintain service levels without increasing costs or administration requirements.
- Extremely high compression rates give the ability to store considerably more information in a given hardware configuration
- A substantially reduced data footprint means much faster data processing, enabling effective satisfaction of Service Level Agreements without extensive investments in processing power
- Minimal administration requirements bring reductions in resource costs, and ensure that valuable IT and business resources will not be diverted from important tasks just to manage and maintain the Informatica ILM Nearline implementation
- High data compression also substantially reduces the cost of maintaining a data center by reducing requirements for floor space, air conditioning and so on.
Informatica ILM Nearline simplifies and accelerates Disaster Recovery scenarios.
A reduced data footprint means more data can be moved across existing networks, making Informatica ILM Nearline an ideal infrastructure for implementing and securing an offsite backup process for massive amounts of data,
Informatica ILM Nearline keeps all detail data in an immutable form, available for delivery on request.
Having read-only detail data available on-demand enables quick response to audit requests, avoiding the possibility of costly penalties for non-compliance. Optional security packages can be used to control user access and data privacy.
Informatica ILM Nearline makes it easy to offload data from the online database before making final decisions about what is to be moved to an archiving solution.
The traditional archiving process typically involves extensive analysis of data usage patterns in order to determine what should be moved to relatively inaccessible archival storage. With an Informatica ILM Nearline solution, it’s a simple matter to move large amounts of data out of the online database — thereby improving performance and guaranteeing satisfaction of SLA’s, — while still keeping the data available for access when required. Data that is determined to be no longer used, but which still needs to be kept around to comply with data retention policies or regulations, can then be easily moved into an archiving solution.
Taken together, these benefits make a strong case for implementing an Informatica ILM Nearline solution when the data tsunami threatens to overwhelm the enterprise data warehouse. In future posts, I will be investigating each of these in more detail.
In today’s post, I want to write about the “Informatica ILM Nearline 6.1A″. Although this Nearline concept is not new, it is still not very known and represents the logical evolution of business applications, data warehouses and information lifecycle approaches that have struggled to maintain acceptable performance levels in the face of the increasingly intense “data tsunami” that looms over today’s business world. Whereas older archiving solutions based their viability on the declining prices of hardware and storage, ILM Nearline 6.1A embraces the dynamism of a software and services approach to fully leverage the potential of large enterprise data architectures.
Looking back, we can now see that the older data management solutions presented a paradox: in order to mitigate performance issues and meet Service Level Agreements (SLA) with users, they actually prevented or limited ad-hoc access to data. On the basis of system monitoring and usage statistics, this inaccessible data was then declared to be unused, and this was cited as an excuse for locking it away entirely. In effect, users were told: “Since you can’t get at it, you can’t use it, and therefore we’re not going to give it to you”!
ILM Nearline 6.1A, by contrast, allows historical data to be accessed with near-online speeds, empowering business analysts to measure and perfect key business initiatives through analysis of actual historical details. In other words, ILM Nearline 6.1A gives you all the data you want, when and how you want it (without impacting the performance of existing warehouse reporting systems!).
Aside from the obvious economic and environmental benefits of this software-centric approach and the associated best practices, the value of ILM Nearline 6.1A can be assessed in terms of the core proposition cited by Tim O’Reilly when he coined the term “Web 2.0″:
“The value of the software is proportional to the scale and dynamism of the data it helps to manage.”
In this regard, ILM Nearline 6.1A provides a number of important advantages over prior methodologies:
Keeps data accessible: ILM Nearline 6.1A enables optimal performance from the online database while keeping all data easily accessible. This massively reduces the work required to identify, access and restore archived data, while minimizing the performance hit involved in doing so in a production environment.
Keeps the online database “lean”: Because data archived to the ILM Nearline 6.1A can still be easily accessed by users at near-online speeds, it allows for much more recent data to be moved out of the online system than would be possible with archiving. This results in far better online system performance and greater flexibility to further support user requirements without performance trade-offs. It is also a big win for customers moving their systems to HANA.
Relieves data management stress: Data can be moved to ILM Nearline 6.1A without the substantial ongoing analysis of user access patterns that is usually required by archiving products. The process is typically based on a rule as simple as “move all data older than x months from the ten largest InfoProviders”.
Mitigates administrative risk: Unlike archived data, ILM Nearline 6.1A data requires little or no additional ongoing administration, and no additional administrative intervention is required to access it.
Lets analysts be analysts: With ILM Nearline 6.1A, far less time is taken up in gaining access to key data and “cleansing it”, so much more time can be spent performing “what if” scenarios before recommending a course of action for the company. This improves not only the productivity but also the quality of work of key business analysts and statistical gurus.
Copes with data structure changes: ILM Nearline 6.1A can easily deal with data model changes, making it possible to query data structured according to an older model alongside current data. With archive data, this would require considerable administrative work.
Leverages existing storage environments: Compared to older archiving products/strategies, the high degree of compression offered by ILM Nearline 6.1A greatly increases the amount of information that can be stored as well as the speed at which it can be accessed.
Keeps data private and secure: ILM Nearline 6.1A has privacy and security features that protect key information from being seen by ad-hoc business analysts (for example: names, social security numbers, credit card information).
In short, ILM Nearline 6.1A offers a significant advantage over other nearline and archiving technologies. When data needs be removed from the online database in order to improve performance, but still needs to be readily accessible by users to conduct long-term analysis, historical reporting, or to rebuild aggregates/KPIs/InfoCubes for period-over-period analysis, ILM Nearline 6.1A is currently the only workable solution available.
In my next post, I’ll discuss more specifically how implementing the ILM Nearline 6.1A solution can benefit your business apps, data warehouses and your business processes.
Under the hood: decommissioning an SAP system with Informatica Data Archive for Application Retirement
If you reached this blog, you are already familiar with the reasons why you need to do a house cleaning on your old applications. If not, this subject has been explored in other discussions, like this one from Claudia Chandra.
All the explanations below are based on Informatica Data Archive for application retirement.
Very often, customers are surprised to know that Informatica’s solution for application retirement can also decommission SAP system. The market has the feeling that SAP is different, or “another beast”. And it really is!
A typical SAP requires software licenses, maintenance contracts, and hardware for the transactional application itself, the corresponding data warehouse and databases, operating systems, server, storage, and any additional software and hardware licenses that you may have on top of the application. Your company may want to retire older versions of the application or consolidate multiple instances in order to save costs. Our engineering group has some very experienced SAP resources, including myself here, with more than 16 years of hands-on work with SAP technology. And we were able to simplify the SAP retirement process in a way that makes the Informatica Data Archive solution decommission SAP as any other type of application.
Next are the steps to decommission an SAP system using Informatica Data Archive.
Let’s start with some facts: SAP has some “special” tables which can only be read by the SAP kernel itself. In a typical SAP ECC 6.0, around 9% of these tables fall in these categories, representing around 6,000 tables.
More specifically, these tables are known as “clusters”, “pools” and I created a third category with transparent tables which have a binary column, or RAW data type, which only SAP application can unravel.
In this step, we will get all the metadata of the SAP system being retired, including all transparent, cluster and pools tables, all columns with data types. This metadata will be kept with the data in the optimized archive.
2) Extraction from source
Informatica Data Archive 6.1.x is able to connect to all database servers certified by SAP, to retrieve rows from the transparent tables.
On the SAP system, it is required to install an ABAP agent, which has the programs developed by Informatica to read all the rows from the special tables and archive files and to pull all the attachments in its original format. These programs are delivered as an SAP transport, which is imported in the SAP system prior to the beginning of the decommissioning process.
Leveraging the Java connector publicly available through the SAP portal (SAPJCo), Informatica Data Archive connects to an SAP application server on the system being decommissioned and make calls to the programs imported though the transport. The tasks are performed using background threads and the process is monitored from the Informatica Data Archive environment, including all the logging, status and monitoring of the whole retirement process happening in the SAP system.
Extraction of table rows in database
Below you can see what all SAP table types are and how our solution deals with it:
|Table type||Table name in SAP
|Table name in the database(Physical table)||How we handle it?|
|Cluster tables||BSEG||RFBLG||The engine reads all the rows from the logical tables by connecting to the SAP application level and store in the archive store as if the table existed in the database as a physical table.The engine also reads all rows of the physical tables and stores as they are, as a policy insurance only, since the data cannot be read without an SAP system up and running|
|Transparent tables with RAW field||PCL2STXL||PCL2STXL||The engine creates a new table in the archive store and read all rows from the original table, but the RAW field is unraveled.The engine reads all rows of the physical tables and store as they are, as a policy insurance only, since the data cannot be read without an SAP system up and running
The engine also reads all rows of the original table PCL2 or STXL and stores as they are, as a policy insurance only, since the data cannot be read without an SAP system up and running
The Informatica Data Archive will extract the data of all tables, independently of their types.
Table rows in archive files
Another source of table rows is the archived data. SAP has its own archiving framework, which is based on a creation of archiving files, also known as ADK files. These files store table rows in an SAP proprietary compacted form, which can only be read by ABAP code running in a SAP system.
Once created, these files are located in the file system and can be stored in an external storage using an ArchiveLink implementation.
The Informatica Data Archive engine also reads the table rows from all ADK files, independent of their location, as long as the files are accessible by the SAP application being retired. These table rows will be stored in the archive store as well, along with the original table.
Very important: After the SAP system is retired, any implementation or ArchiveLink can be retired as well, along with the storage that was holding the ADK files.
Business transactions in SAP systems have the ability to have attachments linked to them. The SAP Generic Object Services (GOS) is a way to upload documents, add notes to a transaction, add URLs relevant to the document, all still referencing a business document, like a purchase order or a financial document. Some other SAP applications, like CRM, have its own mechanism of attaching documents, complementing GOS features.
All these methods can store the attachments in the SAP database, or at SAP Knowledge Provider (KPro) or externally in storages, leveraging an ArchiveLink implementation.
Informatica’s engine is able to download all the attachment files, notes and URLs as discrete files, independent of where they are stored, keeping the relationship to the original business document. The relationship is stored in a table created by Informatica in the archive store, which contains the key of the business document and the link to the attachments, notes and URLs that were assigned to it in the original SAP system.
All these files are stored in the archive store, along with the structured data – or tables.
4) Load into optimized archive
All data and attachments are then loaded into Informatica’s optimized archive,. The archival store will compress the archived data up to 98%
5) Search and data visualization
All structured data are accessible though JDBC/ODBC, as any other relational database. The user has the option to use the search capability that comes with the product, which allows users to run simple queries and view data as business entities.
Another option is to use the integrated reporting, capability within the product, which allows users to create pixel-perfect reports, using drag and drop technology, querying the data using SQL and displaying the data as business entities, which are defined in prebuilt SAP application accelerators. .
Informatica also has a collection of reports for SAP to display data for customers, vendors, general ledger accounts, assets and financial documents.
Some customers prefer to use their own corporate standard 3rd party reporting tool. That is also possible as long as the tool can connect to JDBC/ODBC sources, which is a market standard for connecting to databases.
Hopefully this blog helped you to understand what Informatica Data Archive for Application Retirement does to decommission an SAP system. If you need any further information, please comment below. Thank you.
Adopting SAP HANA can offer significant new business value, but it can also be an expensive proposition. If you are contemplating or in the process of moving to HANA, it’s worth your time to understand your options for Nearlining your SAP data. The latest version of Informatica ILM Nearline, released in February, has been certified by SAP and can run with SAP BW systems running on HANA or any relational database supported by SAP.
Nearlining your company’s production SAP BW before migrating to a HANA-based BW can provide huge saving potentials. Even if your HANA project has already started, Nearlining the production data will help keep the database growth flat. We have customers that have actually been able to shrink InfoProviders by enforcing strict rules on data retention on the data stored in the live database.
Informatica World is around the corner, and I will be there with my peers to demo and talk about the latest version of Informatica ILM Nearline. Click here to learn more about Informatica World 2013 and make sure you sign up for one my Hands On Lab sessions on this topic. See you at the Aria in Las Vegas in June.
In my previous blog I briefly mentioned the term “data temperature.” But what exactly does this term mean? Picture yourself logging to your bank website to look for a transaction in your checking account. Very frequently you want to look for pending transactions and debits and credits that happened in the last 10 days. Frequently you need to look further, maybe one month statement, to search for a check that you don’t remember was for what. Maybe once in a quarter, you need to get information about a debit that happened three months ago, about a subscription of a new magazine that is not coming to your mailbox. And of course, once a year you check yearly statements for your tax return. Give or take a few other scenarios, I am pretty sure I covered most of your use cases, right? (more…)
The buzzword of the 2012 SAP TechEd was HANA. All sessions were fully booked with long lines in the corridors with IT managers, developers, DBAs and BASIS administrators anticipating the miracles of memory computing performance.
I attended sessions about archiving, near-line and preparedness to uptake HANA. On the other hand, those sessions were almost empty, some having less than 10 people in the room. It makes me wonder: do people realize that HANA may be the future, but to be able to deploy it economically and practically, you need to manage the data volume in your current production environment? (more…)