Cloud Data Integration Elastic – Understanding Auto Scaling
Co-authored with Anand Sridharan
Why Auto Scaling?
How should you configure your Cloud Data Integration Elastic (CDI-E) cluster? Is the cluster too large or too small? Does it have the right number of nodes to meet SLAs? All these are difficult questions to answer. Workloads may not stay the same and can differ hour by hour or week by week.
Do you feel like you are under-utilizing the CDI-E cluster nodes or may be wasting data analysts’ valuable time by having a cluster with a smaller number of nodes? How do we achieve a delicate balance without sacrificing cost or performance? Informatica introduced CDI-E as part of its Informatica Intelligent Cloud Services platform in the summer of 2019. CDI-E utilizes auto scaling to address these challenges, thus saving data analysts valuable time and operational costs without sacrificing performance.
Enabling auto scaling is easily done by entering a minimum and maximum number of worker nodes in the Cluster Configuration page. The minimum number of nodes must be at least 1 and the maximum number of nodes can be as high as hundreds depending on your accounts instance limit. Once a job is submitted to the cluster, it will create the cluster with a minimum number of worker node(s) as defined and 1 master node. Auto scaling adds or removes worker nodes to the cluster based on the demand of your workload. On the other hand, for a CDI-E static cluster, the minimum and maximum number of worker nodes are the same.
Cost Benefit vs. Performance Penalty – Static and Auto Scaling Clusters
Static clusters have a fixed cost depending on the cluster run time. With auto scaling, cost varies depending on the demand arising from workloads submitted to the cluster. For workloads that require fewer nodes, auto-scaling clusters could offer significant cost benefits.
Static clusters might perform better than auto-scaling clusters with workloads that would result in scaling up of an auto-scaling cluster.
Essentially, the cost benefit of using an auto-scaling cluster and performance penalty incurred is not constant. The evident difference either in Cost($$) or Performance would fade as the resource demand increases in terms of cluster capacity.
It can be represented as:
T α (1/∆PERF, 1/∆COST) where T=Load Size in terms of Cluster Capacity
or visualized as:
With the cost and performance penalty converging as load increases in an auto-scaling cluster vs. static cluster, one might ask if an auto-scaling cluster is needed at all. The convergence is only during execution of the workload and cost overhead would be there during idle time. Cloud Data Integration Elastic provides idle time-based termination of clusters (cluster lifecycle management) as a part of cluster management and scale-down for auto scaling clusters.
The cost factor converges between a static and auto-scaling cluster when jobs require 80% or more cluster resources. Fig. B demonstrates that with an increase in load, cost factor converges.
The following sections demonstrate the potential benefit of using an auto-scaling over a CDI-E static cluster and also compares integration CDI-E with a standalone Kubernetes cluster in terms of cost when used for an extended period of time (7 days considered for this assessment).
Our test mapping contained an Amazon S3 source and target with multiple transformations. It was used with different data volumes to orchestrate jobs with varying resource requirements on the k8 cluster. We used 6 mappings to produce 10% to 150% load on a 10-node cluster.
A 10% load means the mapping would require 10% of cluster’s resources to run in a single wave. For a cluster with 10-worker nodes, it is 1-node.
Orchestration of Jobs
We created a workflow containing 6 mappings with 6 different scenarios based on load requirements on the cluster (10% to 150% load) and each of the mappings are executed twice, once during scale up and once post scale up. The workflow is repeated 4 times in a 24-hour period for 7 days.
During scale up, there is a performance overhead as it takes 2 to 3 minutes for the nodes to provision and execute PODS. Job completion time is faster when there is no need to scale up the nodes. If a node remains idle for 10 minutes, the cluster auto scaler removes the node from the cluster.
Auto Scaling Cluster Lifecycle
As per Fig. C, the cluster is available with only 1 node in the beginning and provisions additional nodes up to a maximum of 10 nodes per the job requirements. The cluster auto scaler provisions additional nodes when there are more PODS being scheduled. In our test environment, each node supports a maximum of 4 PODS.
Auto Scaling Cost Savings
To assess the benefit of auto scaling, we compared the cost of running the workflow for 7 days in a standalone Kubernetes cluster, a CDI-E static cluster without auto scaling (but with cluster lifecycle management), and a CDI-E auto scaling cluster (with cluster lifecycle management).
In our test of a 10-node auto-scaling cluster over a 7-day period, we observed a savings of roughly 48% against a standalone k8s-cluster and a roughly 28% savings against a static CDI-E cluster. Cluster lifecycle management terminates the cluster after a 30-minute period of inactivity and auto scaling brings up additional nodes when necessary and removes them once the node un-needed threshold is reached.
Whenever someone has questions like the ones mentioned in section “Why Auto Scaling?” considering auto-scaling is recommended. If you are not sure, auto scaling will help for sure!
Appendix: Environment Details
Cluster Configuration Used:
- Static Cluster with 10 Nodes
- Auto Scaling Cluster with minimum 1-node and maximum 10-nodes
- Cluster auto-termination enabled with 30 minutes of inactivity
- Master node type and worker node type: m5.2xlarge (8 Vcores and 32 GiB Memory)
Default Spark configurations
- spark.executor.cores 2
- spark.executor.memory 6GB
- spark.driver.memory 4GB
All the performance claims mentioned in the blog are either observed in development environment or shared to us by our customers. One may or may not achieve the same performance as there are various factors which influence performance results.