Don’t Get Trampled by Wild Elephants, Make Sure Your Big Data is Protected!

big data
Protect your big data

Although many organizations are using big data Hadoop-based projects to predict the next best thing for their customers or to reduce customer churn, security is too often treated as an afterthought, a postscript, tacked on to a project’s implementation.

Lately, it appears that many company’s security infrastructure has been breached. Some companies know their security is breached and some of them are totally unaware of the breach! As more big data projects go into production, the security breaches involving these projects will be big too, with the potential for even more serious reputational damage and legal repercussions than at present.

A growing number of companies are using big data technology to store and analyze petabytes of data including web and machine logs, click stream data, transaction data over longer time horizons, mobile data, and social media content to gain better insights about their customers and their business. As a result, big data security becomes much more critical.

This blog post describes my view on what big data security needs to entail. I classify the following components for big data security:

  • Infrastructure Level Security
  • Data Security

In this blog, I will focus on infrastructure level security and discuss the other security components in future blog posts.

Infrastructure security entails protecting your big data infrastructure against malicious access. The requirements are similar to those for conventional data environments and need to be closely aligned to specific business needs.

The following unique challenges arise with big data security:

  1. Data Lake: As massive amounts of raw data are staged in Hadoop it often becomes a data lake or data hub that feeds downstream analytic applications. The first challenge is to make sure that the data is available only to those who have a legitimate business need for examining or interacting with the data.
  2. Multi-tenant: As multiple applications access the data, each application might need an API to protect itself from unauthorized access.
  3. Data Variety: Different kinds of data from mobile devices and social networks exponentially increases both the amount of data and the opportunities for security threats. Therefore, organizations might require a multi-perimeter approach to security.

Keeping these challenges in mind, the open source community and Hadoop vendors have jointly came up with following layers of infrastructure security:

  1. Authentication: Who am I?
  2. Authorization: What can I do?
  3. Audit: What did I do?

As with any Hadoop project, there are “Zoo of technologies” available in the big data world to address big data security.

The following diagram describes how I’ve simplified infrastructure level security:


Technologies highlighted in green are primarily backed by hortonworks, those in blue are primarily backed by Cloudera. Some of these technologies such as Kerberos are widely adopted across Hadoop distributions.

Any application running on big data needs to work with these security technologies. Before picking any application, whether or not it is a data integration application, make sure that it works with infrastructure level security.

Informatica’s Big Data Edition product integrates with every component of infrastructure level security. You can find more details on this on my blog posted on the Hortonworks website:

Big Data Edition’s security integration is also available for Cloudera & other distros.

Summary: Infrastructure security is the first layer of security in the big data world. It is important for any application running on top of the big data platform to integrate with it.

Stay tuned to read my next security blog post on Data Security.