Make Your Cloud Data Warehouse Truly Elastic with Amazon Redshift and Informatica

Of all the advantages of having a data warehouse in the cloud, the biggest advantage is that your data warehouse can be elastic. You can add or remove compute and storage as your demand grows and shrinks, as frequently as you need.

Resizing with Amazon Redshift

Amazon Redshift allows two major ways to resize your Amazon Redshift cluster—Classic resize and Elastic resize. Each offers different levels of flexibility and users get different downtime behavior. With Classic resize you can change the node type and number of nodes, but it stays read-only during the resizing. With Elastic resize, you can add or remove nodes, but you cannot change the node type. Also, the cluster is available during this operation to read or write. For more information on resizing clusters in Amazon Redshift visit the AWS resizing tutorial page.   The image below shows how you can resize a cluster using Elastic resize.

Screenshot of the Resize Cluster screen in AWS | Informatica
(Screenshot from AWS page)

Once you resize a cluster, Amazon Redshift redistributes data onto the available nodes. The way data is distributed varies between Classic and Elastic resizing.

Informatica and AWS

If you use Informatica to interact with Amazon Redshift, you get the benefits of it immediately as soon as you process your next load or unload. We have rich connectivity to the AWS ecosystem. Informatica’s Cloud Amazon Redshift Connector uses AWS API to interact with Amazon S3 and Amazon Redshift. For more information visit Informatica’s AWS connectors page. 

Informatica has the broadest connectivity both inside and outside AWS. This allows you to fetch data directly from various sources such as ERP, CRM, and API-based endpoints. It also allows you to load data from SaaS applications, such as Salesforce, as well as on-premises and cloud file storage. This data can be loaded to your data warehouse in Amazon Redshift either directly or after applying transformations as you load it.

Informatica has an easy-to-use tool called Cloud Mapping Designer. It helps you configure your tasks to read data from your sources directly and write it to Amazon Redshift after configuring any transformations you want to apply over it. (Watch: Learn how to get started with Cloud Mapping Designer.)

Screenshot of the Cloud Mapping Designer | Informatica

Informatica automatically optimizes its load or unload activities based on the available nodes and slices.

What Informatica does every time you load data to Amazon Redshift:

  • Informatica fetches the latest metadata about the cluster, the nodes, and especially the number of slices available.
  • Using that we create multiple parallel upload threads for loading data to Amazon S3 from your source. These are a multiple of the number of slices available.
  • Note that your data does NOT have to be on Amazon S3. As described above, you can configure your data flows to read data from your source application, database, or files, apply transformations, and write to Amazon Redshift all in one step.
  • When you configure to write data to Amazon Redshift, internally we implement the uploads to Amazon S3 and further copy from there to Amazon Redshift.
  • Each of the uploads to Amazon S3 is further tuned using default parameters that users can override. For example, a user can configure the threshold and part size for multi-part uploads. While these are not directly related to the cluster size, these also allow users to tweak the load performance
  • Once on Amazon S3, these files are then copied to the Amazon Redshift table using the COPY commands in as many parallel threads as these were loaded in.
  • Also, users can partition data using any field values of the data. If you are reading data that is already partitioned—either using source partitions or any partitions configured in the mapping, such partitioning is “passed through” to the target by having multiple parallel threads consistent with such partitions so data is not redistributed among partitions.

Similar options are available when you read data from Amazon Redshift. Most of these can be configured by users based on their understanding of data or their resources. You can configure highly efficient data loads to your data warehouse in Amazon Redshift using these. Using a combination of dynamic optimization using cluster size information and user-configurable parameters, Informatica allows users to take advantage of the elasticity which is vital to cloud data warehousing.

Next steps: