Guide to Ingesting Data Into Your Cloud Data Lake for BI and Real-Time Streaming Analytics

Last Published: Jul 12, 2022 |
Vishwanath Belur
Vishwanath Belur

Data is driving the strategic decisions within organizations. Because data is such an important asset, it is essential to capture data from a variety of sources across the enterprise, including partner ecosystem and third-party data. Many organizations have started initiatives to bring data from the various sources and move it onto data lakes or messaging systems such as Kafka so that they can integrate and analyze the data to help drive critical business decisions.

Business use cases

A cloud data platform is typically used for a variety of business use cases including:

Organizations typically ingest data into a cloud data lake before moving the data into cloud data warehouses where it can be made available for BI and analytics. The challenge is, you need to efficiently and accurately ingest large amounts of data from a variety of sources. That’s where your ingestion solution makes a difference.

Data may come from batch or real-time sources and there are four primary data sources:

  • Files such as local static files, file listeners, or files in FTP servers
  • Change data capture (CDC) data for relational databases
  • Streaming sources such as IoT data, logs, clickstream, or social media
  • Messaging systems such as Apache Kafka, Amazon Kinesis, or JMS

Typical data lake architecture involves the ingestion of data from the above sources onto cloud data lakes or messaging systems (like Apache Kafka). Once the data is available in the lake, various data integration techniques like enrichment, transformations, and aggregation can be applied to the data to make it ready for the business use cases that we described above.

 

 

Fig 1: Cloud Mass Ingestion use cases

 

 

 

Customer requirements for mass ingestion solutions

Organizations are struggling with mass ingestion deployments for a variety of technical and operational reasons and are seeking solutions to meet their business and technical needs:

  1. Simple and unified experience for ingestion: It is difficult to use disparate systems to ingest data from various sources and customers need to use a single unified solution to ingest data from a variety of sources. Also, the experience needs to be simple and easy to use so that ingestion can be done by business analysts instead of depending on IT for each ingestion.
  2. Versatile connectivity: The ingestion solution needs to offer connectivity to various sources like files, databases, mainframes, IoT, and other streaming sources. Also, it needs to ingest the data onto various cloud data lake, warehouses and messaging systems.
  3. Edge transformations: Given that the data is being ingested from remote systems, it is important that the ingestion solution can apply simple transformations on the data (for example filtering bad records) at the edge before ingesting the data onto the data lake.
  4. Address schema drift: Changes in the structure of the source data (often referred to as schema drift) is a key pain point for the customers. Customer expect the ingestion solution to automatically handle the schema drift and propagate it to the target systems.
  5. Real time monitoring and lifecycle management: Given that the ingestion jobs can run for longer duration and potentially never ending, it is important that the ingestion solution provides the real time monitoring capabilities for the jobs to show what is happening in the system at the current moment. It is also important to be able to schedule and manage the jobs by pausing and resuming the jobs.

How does Informatica help?

Informatica offers the industry’s first cloud native unified mass ingestion solution with Informatica Intelligent Cloud Services (IICS) Cloud Mass Ingestion for ingesting data from various sources.

Informatica Cloud Mass Ingestion addresses three main use cases:

  1. Cloud data lake or cloud data warehouse ingestion: Ingestion from files, database tables and streaming and IoT sources onto cloud data lakes like Amazon S3 and Azure ADLS Gen2 to be used for batch analytics.
  2. Accelerate Kafka (cloud messaging service): Ingestion from logs, clickstream, IoT and change data (CDC) from relational sources onto Kafka for real time analytics and distribution.
  3. Database or data warehouse modernization and migration: Ingestion of initial and incremental CDC data from on-premises databases and mainframe systems onto cloud data warehouses like Snowflake, Azure SQL DW

Cloud Mass Ingestion provides a simple wizard-driven experience for building flows to ingest data from batch sources like file and relational databases as well as real-time sources like CDC, IoT systems and other streaming sources. It provides a consistent real-time monitoring and lifecycle management experience for jobs so that you can manage them effectively.

 

 

Fig 2: Unified ingestion solution for a variety of data sources

 

 

 

Fig 3: Simple, intuitive experience for designing and real-time monitoring 

 

Ingestion of data from variety of sources is a key first step in your journey towards cloud data lakes. It is important to have a unified solution to ingest data from various sources using a consistent design, deployment, monitoring, and lifecycle management experience. Informatica offers a unified cloud native Mass Ingestion solution within IICS to address the ingestion use cases of the customers.

Learn more

Visit the Cloud Mass Ingestion product page for more details.

Try Cloud Mass Ingestion for 30 days.

First Published: Jan 12, 2020