Data Lake Management: Introducing a Variety of “Fish” to Your Lake
You’ve built a data catalog of fish (I’m using fish here as an analogy for data), identifying three categories: Near-shore, bottom-shore and off-shore fish and tagging fish according to various categories. Your catalog also goes one level deep by identifying different levels of plankton!
What about introducing new varieties of fishes to the data lake? How rapidly do you introduce new fish? What happens if you introduce larger species to the lake? Maybe you send a fishing boat to catch and distribute the fish (Are you thinking of a pub/sub model?). Do you know how to sift through a net full of many types of fish to extract the ones you’re after (think parsing of application logs or machine data to extract specific elements)
What an ecosystem!
One of the core principles of data lake management is big data management; which is the process of integrating, cataloging, cleansing, preparing, governing and securing data assets on a big data platform. The goal of big data is to ensure data is successfully managed from ingestion thru consumption. Let’s look at the first pillar of big data management, big data integration.
Big Data Integration integrates data from various disparate data sources, at any latency, with the ability to rapidly develop extract load transform (ELT) or extract transform load (ETL) data flows and deploy anywhere on Hadoop, push down to a data warehouse, on-premise or in the cloud. Key capabilities for Big Data Integration include the following:
- Data Ingestion – Ingest data from any source (relational databases, SaaS Application, IoT, third party data, social media), at any speed (real-time, near real-time or batch) using high-performance connectivity through native APIs to source and target systems with parallel processing ensures high-speed data ingestion and extraction.
- Data Transformation – Provides data engineers’ access to an extensive library of prebuilt data integration transformations that run natively in Hadoop.
- Data Parsing – The ability to access and parse complex, multi-structured, unstructured data such as Web logs, JSON, XML, and machine device data.
- Data Integration Hub – is a centralized a publish/subscribe hub-based architecture for agile enterprise data integration.
- Data Pipeline Abstraction – The ability to abstract data flows from the underlying processing technology, such as MapReduce, Spark, or Spark Streaming insulating the solution from rapidly changing big data processing frameworks to promote re-use of data flow artifacts and data engineering skill sets.
- Data Preparation – provides data analysts and data scientist’s self-service data preparation to rapidly find, blend, and transform trusted data sets to gain deeper insight into related data-sets.
Big Data Integration plays an important role as a big data management core capability as it allows data engineers to extract data from various source systems and applications, apply business logic as defined by a data analyst and load the data to a big data store such as hadoop.
We’ve covered the first pillar of Big Data Management. In the next blog we will look at the strategy of managing and controlling the quality of data with cleansed and trusted data sets for consumption by analytic applications.
Update [11/29]: We’ve recently published a reference architecture which will allow you to better fish for greater insights, check out the technical reference architecture paper at http://infa.media/2gDyhmL