I had the opportunity to review and comment on the draft of a new Hadoop technical guide. It’s great to see the published paper: Technical Guide: Unleashing the Power of Hadoop with Informatica. This guide outlines the following five steps to get started with Hadoop from a data integration perspective.
(1) Select the Right Projects for Hadoop Implementation
Choose projects that fit Hadoop’s strengths and minimize its disadvantages. Enterprises use Hadoop in data-science applications for log analysis, data mining, machine learning and image processing involving unstructured or raw data. Hadoop’s lack of fixed-schema works particularly well for answering ad-hoc queries and exploratory “what if” scenarios. Hadoop Distributed File System (HDFS) and MapReduce address growth in enterprise data volumes from terabytes to petabytes and more; and the increasing variety of complex multi-dimensional data from disparate sources.
For applications that require faster velocity for real-time or “right-time” data processing, while Apache HBase adds a distributed column-oriented database on top of HDFS and there is work in the Hadoop community to support stream processing, Hadoop does have important speed limitations. Likewise, compared to an enterprise data warehouse, current-generation Apache Hadoop does not offer a comparable level of feature sophistication to mandate deterministic query response times, balance mixed workloads, define role and group based user access, or place limits on individual queries.
(2) Rethink and Adapt Existing Architectures to Hadoop
For most organizations, Hadoop is one extension or component of a broader data architecture. Hadoop can serve as a “data bag” for data aggregation and pre-processing before loading into a data warehouse. At the same time, organizations can offload data from an enterprise data warehouse into Hadoop to create virtual sandboxes for use by data analysts.
As part of your multi-year data architecture roadmap, be ready to accommodate changes from Hadoop and other technologies that impact Hadoop deployment. Devise an architecture and tools to efficiently implement the data processing pipeline and provision the data to production. Start small and grow incrementally with a data platform and architecture that enable you to build once and deploy wherever it makes sense — using Hadoop or other systems, on premise or in the cloud.
(3) Plan Availability of Skills and Resources Before You Get Started
One of the constraints of deploying Hadoop is the lack of enough trained personnel resources. There are many projects and sub-projects in the Apache ecosystem, making it difficult to stay abreast of all of the changes. Consider a platform approach to hide the complexity of the underlying technologies from analysts and other line of business users.
(4) Prepare to Deliver Trusted Data for Areas That Impact Business Insight and Operations
Compared to the decades of feature development by relational and transactional systems, current-generation Hadoop offers fewer capabilities to track metadata, enforce data governance, verify data authenticity, or comply with regulations to secure customer non-public information. The Hadoop community will continue to introduce improvements and additions – for example, HCatalog is designed for metadata management – but it takes time for those features to be developed, tested, and validated for integration with third-party software. Hadoop is not a replacement for master data management (MDM): lumping data from disparate sources into a Hadoop “data bag” does not by itself solve broader business or compliance problems with inconsistent, incomplete or poor quality data that may vary by business unit or by geography.
You can anticipate that data will require cleansing and matching for reporting and analysis. Consider your end-to-end data processing pipeline, and determine your needs for security, cleansing, matching, integration, delivery and archiving. Adhere to a data governance program to deliver authoritative and trustworthy data to the business, and adopt metadata-driven audits to add transparency and increase efficiency in development.
(5) Adopt Lean and Agile Integration Principles
To transfer data between Hadoop and other elements of your data architecture, the HDFS API provides the core interface for loading or extracting data. Other useful tools include Chukwa, Scribe or Flume for the collection of log data, and Sqoop for data loading from or to relational databases. Hive enables hoc query and analysis for data in HDFS using a SQL interface. Informatica PowerCenter version 9.1 includes connectivity for HDFS, to load data into Hadoop or extract data from Hadoop.
The Informatica platform offers data integration, data quality, and master data management capabilities for data regardless of whether it’s stored in Hadoop or another part of your data architecture. Consider or extend your existing Integration Competency Center (ICC) to support a shared services model to reduce cost and risk; better utilize Hadoop clusters and resources to control sprawl; and further close the disconnect between IT and business to become lean and agile. Look for a solution that supports unified administration of projects, with self-service and data virtualization as part of the platform.
For more, download and review Technical Guide: Unleashing the Power of Hadoop with Informatica, and tune into the Cloudera and Informatica Hadoop Tuesdays webinar series.