Informatica Big Data Management Compatibility with Sentry Enabled CDH Cluster

Introduction

Informatica Big Data Management (BDM) product is GUI based integrated development environment that organizations use to build their Data Integration, Data Quality and Data Governance processes for their big data platforms. Informatica BDM has built-in Smart Executor that supports various processing engines such as Blaze, Spark and Hive on Map Reduce. Informatica BDM can be used to perform data ingestion into a Hadoop cluster, data processing on the cluster and extraction of data from the Hadoop cluster. In Blaze mode, the Informatica mapping is processed by BlazeTM – Informatica’s native engine that runs as a YARN based application. In a Spark mode, the Informatica mappings are translated into Scala code and in a Hive on Map Reduce mode, Informatica’s mappings are translated into Map Reduce code and are executed natively to the Hadoop cluster. Informatica BDM is compatible with Cloudera Hadoop cluster in all related aspects including its default authorization system: Sentry. Sentry can be used to enforce role based authorization to data as well as metadata stored inside a Cloudera Hadoop cluster. This document explains in detail how Informatica BDM’s various processing engines are compatible with Sentry

Authentication

Authentication is the process of reliably ensuring the user is who he/she claims to be. Kerberos is the widely accepted authentication mechanism on the Hadoop platforms including Cloudera Hadoop clusters. Kerberos protocol relies on a Key Distribution Center (KDC), a network service which issues tickets permitting access. Informatica BDM supports Kerberos authentication on both Active directory and MIT-based key distribution centers. Kerberos authentication is supported by all modes of execution in Informatica BDM.

Authorization

Authorization is the process of determining whether or not a user has access to perform certain operations on a given system. In Hadoop clusters, authorization plays a vital role in ensuring the users access only the data that they are allowed to by a Hadoop administrator. When working with HDFS sources/targets, it is recommended to have the HDFS ACL sync turned ON in Sentry

Blaze

When executing mappings on Informatica Blaze, optimizer at first makes an invocation to HDFS Service / Hive Server 2 to fetch metadata information such as hive table’s partitioning details. Then the job is submitted to Blaze Runtime. The illustration below represents how Blaze interacts with HDFS Service / Hive Server 2. When an Informatica mapping gets executed in Blaze mode (with Hive sources/targets), a call is made to the Hive Metastore to understand the structure of the table(s). This information is processed by the Optimizer, which passes an optimized mapping to the TTThe Blaze runtime to load into memory. This mapping then interacts with the corresponding HDFS Service / Hive Server 2 to read the data or write the data. The HDFS Service / Hive Server 2 itself is integrated with Sentry and ensures the authorization is taken place before the request is served. Blaze mode of execution is available starting Version 10.0.

bdm4

Spark

Informatica BDM can execute mappings as Spark’s Scala code on the Hadoop cluster. Below illustration details different steps involved when using Spark execution mode. In this mode, the spark executor makes a call to the Hive Metastore (if Hive sources/targets are involved) to understand the structure of the table(s). This information is processed by the optimizer, which translates Informatica’s mappings into optimized Spark Scala code. This is then submitted to the YARN for execution. When the spark code accesses the data, corresponding HDFS Service / Hive Server 2 relies on Sentry for authorization.

bdm5

Map Reduce

Informatica BDM can execute mappings as map-reduce code on the Hadoop cluster. Below illustration details different steps involved when using Hive on Map Reduce mode.

bdm6

When a mapping is executed in the Hive on Map Reduce mode, the hive executor optionally makes a call to the Hive Metastore (if Hive sources/targets are involved) to understand the structure of the table(s). This information is processed by the optimizer, which translates Informatica’s mappings into Map Reduce and submits the job to the Hadoop cluster. As the map reduce interacts with HDFS Service / Hive Server 2, the corresponding services authorize the requests with Sentry.

Summary

Informatica’s BDM is compatible with Sentry in Blaze, Spark and Map Reduce modes of execution. Informatica’s BDM has Smart Executor that enables organizations to run their Informatica mappings seamlessly on one or more modes of execution under the purview of their existing security setup.

Comments