Hitting the Batch Wall, Part 2: Hardware Scaling

This is the second installment of my multi-part blog series on “hitting the batch wall.” Well, it’s not so much about hitting the batch wall, but what you can do to avoid hitting the wall. Today’s topic is “throwing hardware” at the problem (a.k.a. hardware scaling). I’ll discuss the common approaches and the tradeoffs of hardware scaling with Informatica software.

Before I can begin to discuss hardware scaling, I start with this warning: faster hardware only improves the load window situation when it resolves a bottleneck. Data integration jobs are a lot like rush hour traffic, they can only run as fast as the slowest component. It doesn’t make any sense to buy a Ferrari if you will always be driving behind a garbage truck. In other words, if your ETL jobs are constrained by the source/target systems or I/O or even just memory, then faster/more CPUs will rarely improve the situation. Understand your bottlenecks before you start throwing hardware at them!

Assuming you’ve determined that you are CPU constrained – perhaps you observe high sustained CPU utilization rates during your data integration processing – the most common approach to resolving this is to get more CPU cores or a bigger/faster box. This is generally known as vertical scaling. With the price/performance of computers continuing to fall, one can often replace a three-four year old server with a similarly priced (if not cheaper) computer that is actually much faster. This is often the easiest route. Buy a box with faster CPUs and your integration jobs will usually run faster without any additional tuning. The downside is that this approach eventually runs out of gas. If you need to double or triple your throughput, you may not find an economical single computer solution. Also, if you find yourself replacing your computers less than every three-four years, this may lead to an uncomfortable meeting with the finance folks.

If vertical scaling eventually runs out of gas, what is the alternative? Horizontal scaling – that is, adding additional computers to the data integration infrastructure. This is also known as “grid computing.” While vertical scaling offers the advantage of simplicity, horizontal scaling can bring other positive attributes in addition to increasing compute capacity. That is, grid computing increases system redundancy so the system can offer higher availability. Grid computing can also increase the overall system I/O capacity and I/O is typically the bottleneck of most ETL systems.

How does it increase the overall system I/O capacity? With an I/O heavy application, the motherboard bus speed is often the limiting factor. That is, CPU cores are very fast, but these very fast cores must be continually “fed” with data; otherwise, they sit idle. In a multi-core/multi-CPU computer, all the cores/CPUs share the same bus, and this bus is used to move data between memory and CPU, as well as move data from I/O devices to memory/CPU. Therefore, there is a limit to how many data hungry CPU cores can effectively be fed by this bus.

By scaling with additional computers, each computer brings additional I/O capacity to the system. It also provides for incremental growth. No longer are you asking to “replace” a computer (as with vertical scaling), but rather to “add” a computer. This also solves the problem where a computer that offers twice the performance may, in fact, cost three (or more) times the price (as you graduate from mid-range servers to “big iron.”) Yes, grid adds complexity as there are more moving pieces and requires some additional infrastructure (shared file system.) However, when compared to “big iron,” a grid of mid-tier servers is generally far more cost-effective than the single server offering similar capacity. In addition, Informatica’s grid offerings have focused on easing the management of multi-computer grids to address the horizontal scaling complexity challenge.

To learn more about Informatica’s grid computing offerings, please check out our website and this webinar “Tackle Big Data Using PowerCenter Grid” that I recorded while not writing updates to this blog.

This entry was posted in Data Integration and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>