Data Lake Management: Do You Know the Type of “Fish” You Caught?
“Different types of fish live in a community and when you understand their relationship to each other, you have a better chance of catching what you want.” thompsonadvertisinginc.com
You navigated your way to the lake and read up on the fundamentals of fish management and introduced the data lake management principles. As you drove up to the lake, you were thinking about what type of fish are you looking to catch or are you looking for a trophy fish or to eat a fish? Without knowing the type of fish you’re going to catch, you don’t know how far off-shore you have to boat for a catch.
The primary charter for any data lake initiative is the ability to catalog all the data, enterprise-wide regardless of form (variety) or where it’s stored, whether on Hadoop, NoSQL or an enterprise data warehouse, along with the associated business, technical, and operational metadata. To carry on with our analogy, cataloging fish into off-shore, near-shore or bottom fish can determine the type of fish you catch and how far out you go fishing.
The catalog must enable business analysts, data architects, and data stewards to easily search and discover data assets, data set patterns, data domains, data lineage and understand the relationships between data assets – a 360 degree view of the data. A catalog provides advanced discovery capabilities, smart tagging, data set recommendations, metadata versioning, a comprehensive business glossary, and drill down to finer grained metadata.
Universal Metadata Services manage the vast amounts of metadata from a variety of data sources: traditional and big data. Universal metadata services have the following capabilities:
- Metadata Management – Effectively governed metadata provides a transparent view into the data provenance through end-to-end data lineage, the ability to perform impact analysis, a common business vocabulary and accountability for its terms and definitions, and finally an audit trail for compliance. The management of metadata becomes an important capability to oversee changes while delivering trusted, secure data.
- Data Index – provides the capability to enable data analysts to quickly find data through semantic search and understand data relationships. The data index empowers data analysts to easily collaborate by tagging, annotating and sharing data assets and contribute their knowledge about the data back to the data index.
- Data Discovery – allows analytic teams to find data patterns, trends, relationships and anomalies or business scenario occurrences across all data sources. For example, the data security team may need to identify where personally identifiable information (PII) is used and how that relates to specific business rules/processes. Data masking can be used when to obfuscate sensitive data when dealing with customer PII.
Data exploration or data discovery without metadata makes it a challenge knowing the contents of the data lake and the enterprise. Remember, the core purpose of metadata is to allow the analytics team to gain insight into the provenance, meaning, and relevance of data which has a direct impact on building successful data lakes and data products.
In the next post, I will discuss the three pillars of big data management, starting with big data integration.
Update [11/29]: We’ve recently published a reference architecture which will allow you to better fish for greater insights, check out the technical reference architecture paper at http://infa.media/2gDyhmL