Hadoop Security: Part 6 of Hadoop Series

Security is a work-in-progress for the Apache Hadoop project and sub-projects, as I discuss as part of an O’Reilly Hadoop tutorial, “Get started with Hadoop: from evaluation to your first production cluster”. Below are several of the security tips and best practices that I discuss in that article.

Earlier versions of the Hadoop Distributed File System (HDFS) did not provide robust security for user authentication. A user with a correct password could access the cluster. Beyond the password, there was no authentication to verify that users are who they claim to be. Now, to enable user authentication for HDFS, you can use a Kerberos network authentication protocol, which has tokens for delegation, jobs and access block.

Kerberos authentication is a welcome addition, but by itself does not enable Hadoop to reach enterprise-grade security. As noted by Andrew Becherer in a presentation and white paper for the BlackHat USA 2010 security conference, “Hadoop Security Design: Just add Kerberos? Really?“, remaining security weak points include:

  • Symmetric cryptographic keys are widely distributed.
  • Some web tools for Job Tracker, Task Tracker, nodes and Oozie rely on pluggable web user interface (UI) with static user authentication. There is a jira (HADOOP-7119), which is patch-available, that adds a SPNEGO-based web authentication plugin.
  • Some implementations use proxy IP addresses and a database of roles to authorize access for bulk data transfer to provide a HTTP front-end for HDFS.

Given the current state of Hadoop security, even if your Hadoop cluster will be operating in a local- or wide-area network behind an enterprise firewall, you may want to consider a cluster-specific firewall to more fully protect non-public data that may reside in the cluster. In this deployment model, think of the Hadoop cluster as an island within your IT infrastructure — for every bridge to that island you should consider an edge node for security.

If necessary, you may also want to consider separate clusters. One of Hadoop’s key selling points is its ability to serve as a “grab bag” for data from disparate sources. However, if you have particularly sensitive data such as complete customer records with social security number, name, address, and credit card number, those may not be a good fit for storage in a large multi-use cluster. With how HDFS replicates data across nodes, it is difficult to restrict access to data sub-sets, unless you set up a separate cluster with its own access rights.

In addition to requiring multiple authentication to access HDFS and MapReduce, other security steps may be necessary to secure other Hadoop tools or third-party applications outside of the cluster that are authorized and authenticated to access the cluster. For example, if you choose to use the Oozie workflow manager, Oozie becomes an approved “super user” that can perform actions on behalf of any Hadoop user. Accordingly, if you decide to adopt Oozie, you should consider an additional authentication mechanism to plug into Oozie. Yahoo offers an explanation for how to set up the Oozie workflow manager for Hadoop Kerberos authentication.

Some of the above security concerns you may be able to address using paid software. For example, Cloudera Enterprise 3.5 includes some tools to help simplify configuration of the security features available with current-generation Hadoop.

With a platform approach, you can manage access to Hadoop clusters from within your data integration platform, and restrict direct access to HDFS or tools such as Oozie, and instead require Hadoop access to go through the data integration platform. With this approach, you can use the same security assignments that you have built for roles and groups. You can also continue to use the more finely tuned granular security controls that may not be supported by current-generation Hadoop.

Regardless of which approaches you take for Hadoop security, make sure to verify that Hadoop web and API interfaces are controlled (or in some cases disabled) so that you’ve not created the potential for “back door” access.

This entry was posted in Big Data and tagged , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>