Dating With Data: Part 4 In Hadoop Series

eHarmony, an online dating service, uses Hadoop processing and the Hive data warehouse for analytics to match singles based on each individual’s “29 Dimensions® of Compatibility”, per a a June 2011 press release by eHarmony and one its suppliers, SeaMicro. According to eHarmony, an average of 542 eHarmony members marry daily in the United States.

To marry data between Hadoop and other elements of your data architecture, the Hadoop Distributed File System (HDFS) API provides the core interface for loading or extracting data. Other useful tools include Chukwa, Scribe or Flume for the collection of log data, and Sqoop for data loading from or to relational databases. Hive enables hoc query and analysis for data in HDFS using a SQL interface. Each of these tools connect to the HDFS API.

Informatica PowerCenter version 9.1 includes connectivity for HDFS, to load data into Hadoop or extract data from Hadoop. Customers can use Informatica data quality, data governance and other tools both pre- or post-writing the data into HDFS.

The Java API enables connections to programs written in Java, and the Thrift API supports connections to programs written in C++, Perl, PHP, Python, Ruby or other programming languages. With the Java and Thrift APIs, you can enable business intelligence, analytics or other enterprise applications outside the cluster to access data and run queries in Hadoop.

You can query large data sets using Apache Pig, or the R and Hadoop Integrated Processing Environment (RHIPE). For analytics, Mahout offers a library of data mining and machine learning algorithms.

Hadoop User Experience (HUE) provides “desktop-like” access to Hadoop via a browser. With HUE, you can browse the file system, create and manage user accounts, monitor cluster health, create MapReduce jobs, and enable a front end for Hive called Beeswax. Beeswax provides Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format. Cloudera contributed HUE as an open source project.

There are a growing number of connectors to enable your enterprise data warehouse (EDW) or other data warehouses or marts to “date” with Hadoop. Both Teradata Aster Data and EMC Greenplum provide a two-way, parallelized data connector between their PostgreSQL-derived data stores and HDFS. For Oracle customers, Quest Data Connector for Hadoop allows for data transfer between Oracle databases and Hadoop using a freeware plug-in to Sqoop. One benefit of these connectors is that they support faster data transfer than what is typically possible using standard ODBC or JDBC drivers.

Microsoft announced plans to release community technology previews (CTPs) of Hadoop connectors for SQL Server and its SQL Server 2008 R2 Parallel Data Warehouse, designed to support two-way transfer of both structured and unstructured data. This is welcome news, given the challenges of migrating data from a SQL database to Hadoop, as described in a blog post by Nathan Marz at BackType.

Cascading is an open-source, data-processing API that sits atop MapReduce, with commercial support from Concurrent. Cascading supports job and workflow management. According to Concurrent founder and CTO Chris Wensel, in a single library you receive Pig/Hive/Oozie functionality, without all the XML and text syntax. Nathan Marz wrote db-migrate, a Cascading-based JDBC tool for import/export onto HDFS. At BackType and previously Rapleaf, Nathan also authored Cascalog, a Hadoop/Cascading-based query language hosted in Clojure. Multitool allows you to “grep”, “sed”, or join large datasets on HDFS or Amazon S3 from a command line.

Your choice of hardware and networking infrastructure impacts the speed of transferring data into and out of the HDFS storage layer. For eHarmony, they chose SeaMicro servers and moved Hadoop in-house after previously using cloud services to run Hadoop. The benefits have been to reduce both operating expenses and variability in job completion times.

For most Hadoop on-premise clusters, there are only a handful of enterprise-grade servers – specifically for the NameNode, Secondary NameNode and Job Tracker – with use of commodity hardware, without RAID or virtualization, for the data nodes. Users are protected from data loss by having files replicated among multiple data nodes, with a default replication value of three.

This is not the only architectural option, however, to marry Hadoop and hardware. In addition to the SeaMicro offerings, “Hadooplers” (Hadoop Open Storage Solutions) are the first of several new storage appliances that NetApp is shipping for Hadoop. They are based on NetApp’s new E-Series array family. They are designed to offload Hadoop storage I/O processing and disk failure recovery from the data nodes, provide high availability for the NameNode, and offer the option for RAID-protected data nodes.

This entry was posted in Big Data and tagged , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>