Data Lake Management: Maintaining Data Lake Order with “Fish Management”
Every data lake is dynamic and never exactly alike which makes managing a data lake a daunting task. As we discussed in prior blogs, we know what’s in the lake and introduced a variety of fish to the lake, we need to manage a lake efficiently, making sure the water quality is good, weeds are under control and unwanted inhabitants in the lake as these elements affect the fish population.
As data is generated from a myriad source (databases, IoT devices, 3rd party data, and social media), Big Data Governance & Quality becomes another key component of big data management with the following capabilities:
- Data Profiling helps data analysts understand the structure, completeness, and relevance of data to ensure consistency enterprise-wide while identifying and remediating bad data.
- Data Quality is the ability to cleanse, standardize, and enrich data in the Data Lake using data quality rules and techniques.
- Data Matching is the ability to match and link duplicates within and across multiple sources and links it to create a single view of a data entity. Some data matching products like Informatica also provides algorithms that discover relationships such as for customer householding.
- Master Data Management is the creation of a single authoritative master reference record for all critical business data, leading to fewer errors and less redundancy in business processes.
- Data Retention & Lifecycle Management is the ability to automatically archive redundant and legacy applications for compliance and to controls costs while maintaining necessary access to inactive data.
To prevent unwanted access to the lake, security measures are put in place to ensure the data is protected.
As Big Data Governance & Quality focuses on ensuring data is trustworthy, Big Data Security governs access to data throughout its lifecycle. Big Data Security is the process of analyzing data risk which includes discovering, identifying, and classifying data, as well as analyzing its risk based on value, location, protection, and proliferation. Big Data Security includes the following capabilities:
- Data Masking is the ability to de-identify and de-sensitize sensitive data when used for support, analytics, testing, or outsourcing.
- Data Encryption is the ability to encrypt data using algorithms with an encryption key making it unreadable to any user that doesn’t have a decryption key.
- Authentication is used to determine if the user has access to the Data Lake, while authorization is determining if a user has permission to a specific data or types of data within the Data Lake.
I’ve introduced you to the concept of Data Lake Management and covered the three pillars of big data management: big data integration, big data governance and quality, and big data security. In the next blog post, I’ll talk about intelligent data applications which provide a self-service collaborative platform for data analysts, data scientists, data stewards and data architects to discover, catalog, prepare and secure data in the data lake.
Update [11/29]: We’ve recently published a reference architecture which will allow you to better fish for greater insights, check out the technical reference architecture paper at http://infa.media/2gDyhmL