ELT stands for Extract Load and Transform and is a technique for moving and transforming data from one location and format to another instance and format. In this style of integration, the target DBMS becomes the transformation engine. This is in contrast to traditional ETL, or Extract Transform and Load which is the traditional technique for moving and transforming data in which an ETL engine that is separate from either the source or target DBMS performs the data transformations. Many of the discussions I’ve read on the topic of ETL vs. ELT focus on discussions like how does one work vs. the other and when I should one use a particular approach. I’d like to talk about a few of those questions, but first I’d like to tackle a slightly different question: why should you care about ELT?When thinking about data integration techniques, it’s important to understand what you are optimizing for. Depending on the objectives, one could optimize for timeliness (average time of information availability), cost of integration, availability of information (uptime of the integration chain), data cleanliness and integrity, process auditability or other factors. An understanding of the primary optimization objective is the first step in determining an overall integration approach.
The reality is, with a scalable Data Integration platform like Informatica and a scalable dbms platform, one can basically achieve any level of performance they desire using either integration technique. What we often lose sight of, however, is cost and efficiency. Some of the writings I’ve read on the topic, including the excellent blog from Dan Lindstedt. — ETL, ELT – Challenges and Metadata
Dan introduces what I would consider to be arbitrary volume metrics to determine when to use one technique vs. the other which more or less end up looking like: high volume = ELT, lower volume = ETL.
Informatica supports both techniques. So why do we care about ELT? Because it provides optimal efficiency and reduced cost in certain integration scenarios regardless of the data volumes involved. This is an indirect way of stating that ELT can be the most efficient and therefore proper way to process data regardless of volume. It can also be the least efficient regardless of volume. Let’s explore this further.
Everything in IT has at least four categories of costs associated with it: the upfront licensing cost, the cost of operation (electricity, cooling, etc.), the implementation cost and the maintenance cost which includes both manpower and vendor software support. If we are optimizing integration workloads, we must consider most of these and we must consider them both in the context of integration platform (hardware and software) and dbms (hardware and software) platform costs.
Integration jobs, even on the same project, can have tremendously different signatures. Some information gets passed through relatively unscathed (un-transformed) and is more or less passed through as-is. The assumption is that the time cost of doing format-oriented transformations (unpadding, splitting, text case standardizing, etc) is zero amortized over the networking or process switching costs of moving from one data source to another or the I/O cost of reading and writing the data from the source and to the target respectively. Do the processing in the ETL engine, do it ELT style in the source dbms or target dbms, it doesn’t matter much. It should all perform about the same.
Selection and projection operations, hierarchical operations, set-oriented transformations like heterogeneous joins or aggregates on either entire sets of data or small-sets with in streams of real-time data on the other hand have high compute costs relative to other costs (and in some cases can introduce additional I/O).
No single approach to integration (ELT or ETL) will give an optimal solution to all workloads from either the perspective of cost, efficiency or performance.
Some questions to consider:
1. What is the current workload on my source and target dbms platforms?
2. What is the cost for the hardware and software to add additional computing resources to my dbms environment?
3. What is the cost for the hardware and software to add additional computing resources to my integration environment?
4. What is the relative efficiency of performing a particular operation in my source system, target system or integration system?
The last question of course is the toughest because while the first three questions have absolute answers, the fourth depends on the operation. Also, from a real-world throughput and performance perspective, it varies with the rest of the workload in all three environments.
In the next blog entry, I’ll further explore specific integration problems and break down how each should be executed in order to optimize cost and efficiency.