This post was written by guest author Dale Kim, Director of Industry Solutions at MapR Technologies, a valued Informatica partner that provides a distribution for Apache Hadoop that ensures production success for its customers.
Apache Hadoop is growing in popularity as the foundation for an enterprise data hub. An Enterprise Data Hub (EDH) extends and optimizes the traditional data warehouse model by adding complementary big data technologies. It focuses your data warehouse on high value data by reallocating less frequently used data to an alternative platform. It also aggregates data from previously untapped sources to give you a more complete picture of data.
So you have your data, your warehouses, your analytical tools, your Informatica products, and you want to deploy an EDH… now what about Hadoop?
Requirements for Hadoop in an Enterprise Data Hub
Let’s look at characteristics required to meet your EDH needs for a production environment:
You already expect these from your existing enterprise deployments. Shouldn’t you hold Hadoop to the same standards? Let’s discuss each topic:
Enterprise-grade is about the features that keep a system running, i.e., high availability (HA), disaster recovery (DR), and data protection. HA helps a system run even when components (e.g., computers, routers, power supplies) fail. In Hadoop, this means no downtime and no data loss, but also no work loss. If a node fails, you still want jobs to run to completion. DR with remote replication or mirroring guards against site-wide disasters. Mirroring needs to be consistent to ensure recovery to a known state. Using file copy tools won’t cut it. And data protection, using snapshots, lets you recover from data corruption, especially from user or application errors. As with DR replicas, snapshots must be consistent, in that they must reflect the state of the data at the time the snapshot was taken. Not all Hadoop distributions can offer this guarantee.
Hadoop interoperability is an obvious necessity. Features like a POSIX-compliant, NFS-accessible file system let you reuse existing, file system-based applications on Hadoop data. Support for existing tools lets your developers get up to speed quickly. And integration with REST APIs enables easy, open connectivity with other systems.
You should be able to logically divide clusters to support different use cases, job types, user group, and administrators as needed. To avoid a complex, multi-cluster setup, choose a Hadoop distribution with multi-tenancy capabilities to simplify the architecture. This gives you less risk for error and no data/effort duplication.
Security should be a priority to protect against the exposure of confidential data. You should assess how you’ll handle authentication (with or without Kerberos), authorization (access controls), over-the-network encryption, and auditing. Many of these features should be native to your Hadoop distribution, and there are also strong security vendors that provide technologies for securing Hadoop.
Any large scale deployment needs fast read, write, and update capabilities. Hadoop can support the operational requirements of an EDH with integrated, in-Hadoop databases like Apache HBase™ and Accumulo™, as well as MapR-DB (the MapR NoSQL database). This in-Hadoop model helps to simplify the overall EDH architecture.
Using Hadoop as a foundation for an EDH is a powerful option for businesses. Choosing the correct Hadoop distribution is the key to deploying a successful EDH. Be sure not to take shortcuts – especially in a production environment – as you will want to hold your Hadoop platform to the same high expectations you have of your existing enterprise systems.