Engineering High Performance Data Virtualization with Informatica Data Services

Engineering High Performance Data Virtualization with Informatica Data Services

Working on IDS

Informatica Data Services (IDS) is a data virtualization software stack built on a new platform that was created from the various shared internals of the venerable PowerCenter. One of the significant reasons for the shift into the new platform (code named ‘Mercury’) was to make it easy to quickly write new ‘plug-in’s for the integration platform and improve the time to market for new business initiatives with scalability and extensibility; one of these initiatives was IDS.

The IDS suite is a set of “end-points” within the Data Integration Service (DIS) of Mercury. The DIS, which does resource and engine management, works with the Model Repository Service (MRS) for persistence on the Informatica Services Platform. Old hands at PowerCenter will recognize that the DIS and MRS are counterparts of PowerCenter’s LM (Load Manager) and the C++ Repository Service (CRS).

IDS allows users to access Informatica data objects, data sources, and mapping operations via standard interfaces such as SQL, JDBC/ODBC and Web Services. I had the opportunity to design and engineer various parts of the new platform in the DIS and the SQL Data Service, colloquially termed as SQL End-point (SQLEP). The notion of exposing traditional Informatica PowerCenter as a SQL consumable table brings in a litany of benefits like homogenous view of disparate sources and targets, rapid sharing and prototyping with complex data quality and cleansing transformations, lower development and maintenance costs, seamless analytical integration with a variety of tools supporting JDBC/ODBC, to name a few.

Request Isolation

Through my journey of the development of SQLEP, I have had the opportunity to deal with a variety of technical challenges, some more interesting than others. SQLEP’s initial intended use-case was rich transformation support for data aggregation to be quickly consumed in a variety of BI tools. An enterprise could possess data across any number of data stores and appliances, and SQLEP would provide a homogenous and holistic view by utilizing the well-known and well-adopted Informatica Mapping Language (IML). SQLEP is responsible for the maintenance and book-keeping of client requests along with the translation and validation of the SQL query to an equivalent Informatica Mapping.

Our engine, the DataTransformation Machine (DTM), is written in C++ and only works on the IML, while the DIS, which manages the DTM as a resource within its own process, is written in Java. Initially, the communication and the serialization and deserialization (SerDe) of objects between the engine and DIS used JNI. This worked quite well, especially in terms of throughput and performance but one problem with housing the DTM within the DIS process was that a catastrophic failure in the execution of any mapping would terminate the DIS process along with all the other DTM runs. A secondary problem was that of unaddressed memory leaks which would result in a practically unusable DIS at some point.

This lead to the development of a new execution paradigm within Mercury, termed Out-of-Process (OOP), where we housed each DTM in its own process “outside of” the DIS. This was great for isolation and memory leaks (because the new process exited after the mapping executed) but wait, we were spawning a new process for each run! And yes, it suffered tremendously due to high latency compared to the JNI version, the “in-process (IP)” model. I was tasked with the rewrite of OOP in order to achieve a) better latency b) preserve concurrency and affinity and c) provide some amount of isolation from the DIS process. Essentially, get everything that IP offers without the drawbacks. We termed it OOP++!

Request Isolation while Maintaining Throughput

We addressed the problems with a multi-pronged approach: we overcame the issue of latency by having a configurable number of processes start-up along with the DIS to overcome any initial bursts of requests. The concept of ‘process-sunset’ (a configurable amount of time before a process was retired) and ‘process-affinity’ (mappings belonging to a specific application) was introduced. Process-sunset ensured that any unaddressed resource leaks in the process were eventually cleaned out, while process-affinity allowed multiple concurrent executions within one process provided they had the same application affinity. This improved latency by reducing, and delaying, subsequent spawn of new processes and also prevented a catastrophic failure in one mapping from affecting other clients.

Everything seemed fixed, and things were running great, right? NOT! The throughput between the BI client and the SQLEP service in OOP was half of that of the in-process model! But because our intended use-case for SQLEP was data aggregation, performance was not a priority. Fast-forward a bit, Informatica’s Information Lifecycle Management (ILM) wanted to leverage SQLEP for replication, but replication belonged to the other end of the spectrum of aggregation, which is that they want the entire data loaded. This called for a new task for me: improving high-volume throughput.

At first glance, many areas for improvement were revealed; for one, we were transmitting everything back in verbose XML. We quickly altered it to utilize a faster implementation, termed Fast-Infoset XML, which while better, was still verbose. We then went all out with binary SerDe, which yielded substantially better results. However, it was still not enough, and there was a ~30% overhead compared to the IP counterpart. A colleague and I were puzzled at the outcome, and started to analyze a sample request very closely, and we discovered that the communication was no longer the bottleneck. The problem seemed to be in the DIS, where another layer of inefficient and extraneous SerDe was being performed. Within a matter of days, we were able to alter that and have another test run; the results soared! Well, OK, they didn’t soar, but they were within ~8-10% of the IP counterpart. But the fact that we caught the offending piece of code in a completely different area was very satisfying.