Securing Sensitive Information in the Big Data World

Securing Sensitive Information in the Big Data World
Securing Sensitive Information in the Big Data World
The last year have seen a lot of improvements in big data security.  Some are available natively as part of the Apache framework and various Hadoop distributions.  Different distributions have different levels of security capability.  While additional security is offered through distribution-neutral software solutions, like Informatica Data Security.  Let’s review the following security categories and the available solutions:

    • Sensitive data discovery & classification
    • Analysis of sensitive data proliferation
    • Authentication
    • Authorization
    • Advanced data protection
    • Auditing
    • Sensitive Data Risk Analyticsry & Classification

Sensitive Data Discovery & Classification

Informatica Secure@Source  automates the discovery and classification of sensitive data on Hive.  Secure@Source also identifies the protection status of the data.

Analysis of Sensitive Data Proliferation

Informatica Secure@Source analyzes sensitive data proliferation through Informatica Big Data Integration and Management in and out Hive.  In the future, it will also integrate proliferation information from Cloudera Navigator.  With more sensitive data proliferation, the threat surface increases and the higher the risk of sensitive data breach and potential exposure.


Authentication on Hadoop is generally delivered through Kerberos.  In addition, Apache Knox provides centralized authentication for Hadoop services.  It integrates with LDAP, Active Directory, and Identity Management / Cloud Single-Sign On providers.  It covers Hive, Hbase, HDFS, Oozie, and Hcat services.


There are multiple layers and levels of authorization available on Hadoop.  Service level authorization is delivered by Apache Knox.  Apache Sentry (still in incubation) provides Role-based Access Control at the server, database, table, and view levels (select, insert, transform operations) for Hive and Impala.  Cloudera also just introduced RecordService (in beta) for fine grained (row and column-level permissions) unified access control enforcement across storage frameworks including HDFS and HBase and compute frameworks including Spark, MapReduce, Hive, and Impala.   RecordService enforces security on the read path and leverages existing Apache Sentry permissions.  Apache Ranger also provides central policy management to control access to files, folders, databases, tables, or columns on HDFS, Hive, and Hbase, with Knox, Solr, Kafka, and Yarn.  Ranger supports multiple authorization methods, including Role-based Access Control (RBAC), and Attribute-based Access Control (ABAC).

Advanced Protection

Cloudera offers file-systems based encryption to secure data on HDFS files, HBase records, Hive metadata, and audit logs (both at rest and in transmission) as part of Navigator.  Secure key management is available as part of Cloudera Security.

Hortonworks offers HDFS data encryption and key management system through Apache Ranger.

Informatica Big Data Management with Informatica Persistent Data Masking de-identifies sensitive data on Hive on text files while maintaining the original characteristics of the data for testing and analytics.  Sensitive data can be masked on ingestion into Hadoop or in place, once data has landed on Hadoop.


Cloudera enables audit of data access on HDFS, Impala, Hive, HBase, and Sentry with Navigator.

Hortonworks supports auditing of policy updates and data access on HDFS, HBase, Hive with Knox, Kafka, Yarn, Solr, Storm through Apache Ranger.

Sensitive Data Risk Analytics

Informatica Secure@Source correlates information about the existence, volume of sensitive data, its protection status, the level of sensitive data proliferation, how many people have access to the data, who are actually accessing it, and the cost if the sensitive data were exposed to calculate the risk score associated with data stores, regions, departments, to assess the overall risks to the organization.  Secure@Source identifies the highest risk data stores and groups to prioritize remediation efforts to secure the sensitive data.  Secure@Source also enables high risk conditions to be detected, such as when sensitive data moves outside of a highly regulated country or is accessed by a user outside of its country of residency.  With Secure@Source, organizations can also track how it’s doing to reduce the overall sensitive data risk within the organization.

The table below summarizes the security capabilities available natively on Hadoop and those that are available from distribution neutral data security software providers, such as Informatica.


Security is starting to be top of mind for big data environments.  New security features from the Hadoop distributors as well as independent software vendors like Informatica are starting to provide sensitive data discovery, risk analytics, authentication, authorization, audit, and advanced protection capabilities for various data formats and services.  Expect more capabilities to come in the near future.