Hadoop Mapreduce to Apache Spark : Data Storage and Processing Strategy Transformation

“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers” — Grace Hopper

Data storage and processing technologies has gone through dramatic transformation from pre-stage flat-file system to relational database system (RDBMS). Undoubtedly, RDBMS has been de facto database system for longer period of time for both small and large scale industries because of its relational model (structured/tabular storage of data).

As business grows, so as volume of data, RDBMS became inefficient and less cost effective database technologies. RDBMS is not effective in terms of providing scalable solution to meet the needs high volume of data flowing with high velocity. Horizontal scaling (data distribution at multiple nodes/server) is not possible in relational database; only vertical scaling of system (processing speed and storage increase) is feasible with an upper limit of it.

Organizations success lies under their ability to extract value from their own and other organizations’ data. This large volume of data of different variety flowing with high velocity coins term Big data(order of TB or PB). In order to deal with these big data’s attributes (High –Volume, Velocity and Variety), Doug Cutting and  Michael J. Cafarella inspired from Google file system & Mapreduce paradigm white and developed distributed file system (HDFS) and data processing engine(Hadoop MapReduce).

Classic Hadoop 1.x – HDFS and Mapreduce are two major component of Hadoop. Hadoop provides a data storage abstraction and HDFS is one of its implementation. Mapreduce is a programming paradigm used for batch data processing in Hadoop. DataNode & NameNode are sub component of HDFS and Job tracker & task tracker are sub component of Mapreduce.
Hadoop 1.x is known for bringing big data evolution and providing an efficient data processing framework along integrated distributed storage capabilities. However, with time Hadoop proved in-efficient and of limited use in terms of – limited up to 4000 nodes per cluster, static map and reduce slots, only map-reduce job can be executed, Job tracker bottleneck – resource management, scheduling and monitoring  responsibility of single node. Below diagram gives an overview of HDFS and MapReduce tight coupling in Hadoop 1.x.




Hadoop 2.x – In order to resolve above mentioned limitations, Hadoop revamped its architecture and introduced a new component named as YARN (Yet another resource negotiator) for job scheduling and cluster resource management. By doing so, resource management job and scheduling were decentralized and YARN provides an opportunity to run jobs other than map-reduce jobs on Hadoop cluster. Other advantages are – No of cluster nodes increased from 4000 to 10000, multiple namespace for managing HDFS, efficient cluster utilization.Below diagram shows Hadoop 2.x components high level overview.



Hadoop is known for general batch processing engine, if our application does not care about low latency Mapreduce might be a right choice. However, there are various specialized system (like interactive, iterative, graph, machine learning) which requires low latency processing engine. Between 2007 to 2015 various independent projects were developed in order to serve these requirements like Apache storm, GarphLab, Apache Tez, Giraph,etc. and they are part of Hadoop ecosystem.Below diagram show a high level view of various data processing system developed before introduction of Apache spark.

data processing system

Apache spark – Apache spark is a general unified engine which has been developed at AMP lab to full fill requirement of both batch and iterative processing including Machine learning, streaming and graph processing.
Apache spark has in-memory processing capability which makes it apt for tuning algorithm in machine learning, streaming system and graph processing where low latency is required. Apache spark relies on other distributed system like HDFS, local file system, Amazaon  S3 for storage.

Two major fundamental core concepts of Apache spark are: a data abstraction using Resilient Distributed Dataset (RDD) and lineage graph (DAG). RDD is immutable data object which cannot be modified once created. Lineage graph is collection of object in tree structure which provides fault tolerance; if at point any RDD is lost it can be regenerated with this lineage graph. Below diagram gives high level overview of Apache spark and its major components.




Is Apache spark is replacement of Mapreduce – One word answer is No; Apache spark has been developed to deal with relatively less volume of data where latency is prime factor to be considered. Since Apache spark does processing in memory so it is faster compared to Hadoop Mapreduce but if we have very huge data set and we do not have sufficient memory to loaded all at once then Apace spark cannot be worked with charm.

Apache spark is more promising compared to Hadoop mapreduce in terms of its efficiency matrix. Apache spark is 10x-100x faster than mapreduce. It also requires special mention that with spark we can do development in any languages like Java, Scala, Python and R. However, mapreduce is tightly integrated with verbose Java.Read in details regarding tug of war of Hadoop mapreduce and Apache spark.

For more Big data/Mapreduce related posts refer : http://www.devinline.com/search/label/BigData