The Harry Potter books and movies were a particularly popular inspiration for project names. For example, at LinkedIn, to empower features such as “People You May Know” and “Jobs You May Be Interested In”, LinkedIn uses Hadoop together with an Azkaban batch workflow scheduler and Voldemort key-value store. We’ll see if the Twilight series has a similar impact on project names.
In most cases, organizations use a mix-and-match combination with one or more of the following in addition to Hadoop:
- Relational database management systems (RDBMS) such as Microsoft SQL Server, Oracle, MySQL (Oracle), IBM DB2 or Teradata;
- The many variants of online transaction processing (OLTP) systems;
- SAP, SAS, IBM SPSS, KXEN, Fuzzy Logix analytics libraries, Mahout, or the R programming language for statistics and analytics;
- PostgreSQL-based data warehouse and analytic systems such as EMC Greenplum, EnterpriseDB or Teradata Aster Data;
- Databases that began as columnar and have been adding support for rows, such as HP Vertica, ParAccel or SAP Sybase;
- Key value stores derived from the Amazon Dynamo distributed database (Apache Cassandra, Riak or Voldemort);
- Document-oriented databases such as CouchDB, MongoDB, Raven DB or MarkLogic’s XML-based system;
- Complex event processing or other stream processing systems;
- Graph databases such as GraphDB, InfiniteGraph, Neo4J or FlockDB (Twitter); and
- Collaboration and visualization tools, such as Salesforce.com Chatter and Tableau Software.
Where does Hadoop fit in this long and ever-growing list? There are four key points to consider.
(1) Remember Hadoop’s Strengths and Weaknesses
Hadoop clusters can be an excellent fit for data-science applications involving raw or unstructured data, such as log analysis, data mining, machine learning and image processing.
In contrast, Hadoop does not offer industry-leading performance for structured relational data. A recent SIGMOD paper “Efficient Processing of Data Warehousing Queries in a Split Execution Environment” from Yale and University of Wisconsin academics found that Hadoop was a factor of 50 times less efficient for structured relational data. Importing relational data into Hadoop is like taking a finished jigsaw puzzle and breaking back into little pieces to fit into the puzzle box; the pieces are still there, but there is less structure. Despite limitations on support for relational structured data in Hadoop, some organizations do import some structured data into Hadoop as one component of a “grab bag” of data from disparate sources, such as for analysis of ad hoc “what if” scenarios.
(2) Consider the Architectural Fit
Second, from an enterprise architecture perspective, Hadoop can fit well either as a pre-processing step before loading into an enterprise data warehouse (EDW) or production system, or to spin out analytic sandboxes from an EDW or collection of data warehouses, marts or source systems. I’ll explain each of these options a bit more.
You can use Hadoop clusters for data aggregation, pre-processing and cleansing before loading verified data into a EDW or production system. In some cases, you may want to archive raw data and then decide down the road how to use it. A related option is to use Hadoop to provide analysts access to the most immediate raw data available before it goes through the traditional data quality check and data transformation. The EDW continues to be the repository of record for a “single view of the truth”.
At the same time, a Hadoop cluster is one option to spin-out a self-service virtual data mart for data scientists, analysts or line-of-business users to study a particular problem or address “what if” scenarios. When their study is finished, those storage and compute resources can then be re-provisioned for other tasks, avoiding ongoing shadow data marts and protecting data consistency. Both are viable approaches to incorporate Hadoop into your enterprise architecture.
(3) Distinguish between Hadoop and Cloud Computing
Third, it’s important to understand that while Hadoop and cloud computing overlap, there are important distinctions. While Amazon Elastic MapReduce builds proprietary versions of Apache Hadoop, Hive, and Pig optimized for running on the web-scale infrastructure of Amazon Elastic Compute Cloud (EC2) or Simple Storage Service (S3), Hadoop is not well suited for running a cluster that spans more than one data center, whether it’s on-premise or hosted by a Hadoop cloud service provider such as Amazon Web Services, Rackspace and Softlayer.
Even at Yahoo, which runs the largest private-sector Hadoop production clusters, no Hadoop file systems or MapReduce jobs are divided across multiple data centers. Organizations such as Facebook are considering ways to federate Hadoop clusters across multiple data centers, but it does introduce challenges with time boundaries and overhead for common dimension data.
(4) Be Realistic about the Relative Maturity of Hadoop Feature Development
Fourth, compared to the decades of feature development by relational and transactional systems, current-generation Hadoop offers fewer capabilities to track metadata, enforce data governance, verify data authenticity, comply with regulations to secure customer non-public information, or manage mixed workloads. The Hadoop community will continue to introduce improvements and additions – for example, HCatalog is designed for metadata management – but it takes time for those features to be developed, tested, and validated for integration with third-party software.
To address those needs, you may benefit from a platform approach. A platform offers data integration, data quality, and master data management capabilities for data regardless of its source or format, to view operations — including from machine learning, mobile networks, web analytics and social media — across customers, products, locations, and other entities.
Visit my O’Reilly Radar blog for a case study of how one organization, T-Mobile USA, adapted its data architecture with a platform approach and use of Hadoop.