How to go Hadoop-less with Informatica Data Engineering and Databricks

This blog was co-authored by Nauman Fakhar, Director of ISV Solutions at Databricks.

Apache Hadoop was born as an on-premises platform. Most of the use cases for early commercial Hadoop vendors focused on on-premises implementations of the open source data analytics platform. Eventually, Hadoop-as-a-Service—meaning Hadoop running in the cloud became increasingly popular.

However, the Hadoop-as-a-Service model ran into the following challenges:

  • Core Hadoop services such as YARN and HDFS were not designed from the ground up for cloud native environments.
  • The DevOps and management overhead of Hadoop still remained a challenge.
  • Reliability and performance at scale was missing in data lakes built on the Hadoop stack on cloud infrastructure.

With the acceleration of data engineering and AI workloads moving into the cloud, customers’ expectations from the underlying data platform also evolved. Customers now expect:

  • Convenient, on-demand, fully managed big data clusters delivered either as Platform as a Service (PaaS) or Software as a Service (SaaS)
  • Pay as you go, flexible pricing models
  • Smart auto-scaling for managing TCO and SLAs
  • Reliability and higher quality of data housed in cloud data lakes to make sound data-driven business decisions
  • High performance of both ingestion and queries on very large datasets

Cloud providers currently offer convenient on-demand managed big data clusters (PaaS), with a pay as you go model. With PaaS, analytical engines such as Apache Spark come ready to use, with a general-purpose configuration and upgrade management system. We no longer have the need for long-running Hadoop clusters for most of the jobs in the current data pipelines.

With fast-changing trends and ecosystems, Informatica plays a critical role as an abstraction layer for customers. Customers can choose any technology and any vendor to process and store their data using Informatica Data Engineering Integration.

Informatica customers are able to leverage simple drag-and-drop functionality to build complex data pipelines against any big data vendor and technology. When they want to move to a different vendor or distribution service, the pipelines work seamlessly without any code changes for the customer. In this way, Informatica Data Engineering Integration customers can future-proof their big data management platform against changing big data technologies.

Informatica and Databricks

Databricks is a managed, cloud native, unified analytics platform built on Apache Spark. Databricks is also the creator of Delta Lake, which allows customers to create reliable and performant data lakes on their cloud of choice.

Informatica and Databricks have partnered to help organizations realize big data value sooner by making ingestion and preparation of data for analysis and machine learning easier. This integration dramatically increases productivity across the organization.

Data Engineering

Data engineers, data scientists, and administrators don’t need to spend time configuring and optimizing clusters and manually maintaining or scaling the data platform. Instead, data engineers can spend time building data pipelines for machine learning and analytics. And because Informatica Data Engineering Integration offers a visual paradigm for expressing data engineering workloads, organizations that don’t have Python, Scala, R or SQL programming language skill sets can still leverage the power and scale of Databricks from a GUI-based environment.

For customers who are looking to migrate from the traditional Hadoop architecture to a cloud-native platform like Databricks, this article highlights the issues and benefits of changing trends in big data architecture. The article also suggests several best practices for migrating to cloud and serverless technologies.

Architecture Changes for Hadoop vs Databricks on Different Services

YARN

In long-running Hadoop clusters, YARN manages capacity and job orchestration. It requires users to learn complex configurations to balance capacity and performance needs of multiple users.

A cluster in Databricks is a light-weight concept, which can be created on demand very quickly by leveraging the native elasticity and scale of the underlying cloud.

This on-demand model relieves users from the operational burden of managing capacity in shared long running clusters—users easily spin up elastic clusters that automatically expand or shrink with workload demands and can shut down automatically in quiet periods. This allows Databricks users to focus on analytics, instead of operations.

HDFS

On Hadoop, HDFS is used as the storage layer. It is like a distributed file system that is tied to compute.

Databricks leverages cloud-native storage such as S3 on AWS or ADLS on Azure, which leads to an elastic, decoupled compute-storage architecture. Such an architecture allows users to scale compute independently of storage and relieves them from having to capacity plan their storage needs or deal with scalability limits of HDFS name nodes. 

Data Lake, SQL and NoSQL

Hadoop includes engines such as Hive, open source Spark and HBase. While these engines had their merits as first-generation big data products, they aren’t well suited to build a reliable and performant cloud native data lake today.

Databricks includes Databricks Runtime. A Databricks implementation of Apache Spark, which is much more performant, scalable and enterprise ready than open source Spark.

Databricks also includes Delta Lake, which enable users to build reliable and performant data lakes on cloud storage. Support for transactional pipelines, autonomic caching and data clustering techniques make it possible to build a truly enterprise grade data lake.

For NoSQL capability, Databricks integrates with cloud-native services.

For more information on architecture changes, refer to the Databricks documentation.

Best Practices to Migrate from Hadoop to Databricks Through Informatica Data Engineering Integration

Customers who plan to switch from Hadoop to Databricks should be aware of the following key changes:

Ingestion

Sqoop is not available on Databricks. Customers should use Data Engineering Integration mass ingestion to ingest data to any cloud storage layer that Databricks supports. Customers can use Informatica’s JDBC V2 connector for Databricks to ingest data directly into Delta Lake

Data Objects

Hive:  Hive is a SQL layer on HDFS that allows you to access data on HDFS through SQL representation. Customers migrating from Hadoop to Databricks, should migrate their Hive datasets to Delta Lake.

Databricks Delta Lake: Delta Lake provides ACID transactions, versioning, and schema enforcement to Spark data sources.

Just as Data Engineering Integration users use Hadoop to access data on Hive, they can use Databricks to access data on Delta Lake.

NoSQL

Hadoop customers who use NoSQL with HBase on Hadoop can migrate to Azure Cosmos DB, or DynamoDB on AWS, and use Data Engineering Integration connectors to process the data. This is a sound architectural strategy as customers expect a cloud-native, managed, and elastic alternative to HBase when migrating NoSQL workloads from Hadoop to cloud.

Data Processing

Customers can use Informatica transformations with drag-and-drop functionality, relieving developers of the need to write code to process data. Informatica transformations are compatible with any Hadoop or non-Hadoop vendors, making it easier for customers to switch between vendors and technologies. Customers need to follow some best practices while migrating to Databricks:

Jobs that were configured to run with Hive engine must be updated to run with Databricks.

Follow the link to learn how to update your Hive jobs.

Update all the hive mode mappings to run in Data Bricks after upgrading to 10.2.2 version

Sequence Generator transformation:Use UUID4 or monotonically_increasing_id() function in Spark.

For more information on how to accelerate data pipelines for AI and analytics with Informatica and Databricks refer to this documentation.

Performance and Concurrency

Job concurrency on a shared cluster behaves differently on Databricks and Hadoop. Hadoop uses YARN, which includes job scheduler and resource pools to orchestrate jobs. YARN also launches a new Spark Driver for each Spark job to allow job recovery and concurrency on one cluster. But the resources to YARN are limited by the overall capacity in the cluster. This may result in increased job completion times, resource contention, missed SLAs, and operational burden of dividing limited capacity amongst multiple competing workloads

Databricks is a cloud-native product that relieves customers from waiting for cluster resources. With Databricks, a single job is allowed to consume all resources on the cluster, thereby improving the performance significantly for jobs and reducing operational risk

Since a Databricks cluster is backed by the elasticity of the underlying cloud, it’s a much lighter weight and agile component in comparison to a monolithic fixed-capacity long running YARN-Hadoop cluster. As a best practice, you should design your architecture to segregate independent data engineering pipelines into their own clusters. This allows for a “pay for what you use” elastic model and results in both minimal operational burden and lower TCO as no resources are wasted (Databricks clusters can automatically shut down once jobs are finished).

Some existing pipelines may require you to run multiple jobs on the same cluster. While this is possible on Databricks, there are some guidelines to keep in mind:

  • A cluster in Databricks has a single Spark Driver­­—therefore ensure that your Driver has enough resources by choosing larger virtual machines to accommodate the load.
  • A job that got into cluster first may be allowed to dominate capacity on the cluster, which could result in queuing of jobs. If this is an issue, auto-scaling policies can be used to ensure SLAs are met.
  • Try not to exceed 40 to 50 concurrent jobs on a shared cluster
  • Consider following the best practice of segregating a pipeline into its own cluster for optimal performance and segregation on a cloud architecture.

Future-Proof Data Engineering

With the evolution of cloud-based big data pipelines, Informatica plays a critical role in future-proofing the data engineering platform.

Informatica and Databricks together provide an efficient way to process your data and help reduce the cost to compute with auto-scaling capabilities of Databricks. Customers can design some of the most advanced data pipelines with no coding involved and minimal cluster maintenance activities. For customers who are migrating from Hadoop to a Spark-based compute engine like Databricks, there can be architectural changes to consider as outlined in the above sections. Please refer to Informatica documentation for any help you need with creating Databricks connection and running mappings and workflows on Databricks from Data Engineering Integration.

To learn more, watch the on-demand webinar: Building Intelligent Data Pipelines for AI/ML Projects.

Comments