YARN and MRv2: The New Incarnations In The Cloud!

As you all know, MapReduce has undergone a complete overhaul in hadoop-0.23 (or from CDH4 onwards, for the Cloudera fans like me) and we now have, what we call, MapReduce 2.0 (MRv2 or simply MR2) or YARN. You might have read a lot about the same and I’m not going to explain it all over again here. But I feel a little recapitulation will be useful for all of us.

Aging MapReduce

The major change being, from CDH4 onwards, there is no JobTracker and TaskTracker (I really miss you guyz!). Instead of the stripped out JobTracker and TaskTracker, we now have what we call YARN and MR2. So what exactly triggered this large overhaul? What was wrong with those cool JobTracker and TaskTracker that took care of all our MapReduce jobs and all? That was the question I’m confronted with at first. To answer the question, you have to research a bit, then you’ll find out that these JobTracker and TaskTrackers are incapacitated when run on a cluster having more than 4000 nodes, where each node having 2 Quad Core Xeon CPU @ 2.5GHz and with a cluster capacity of around 16PB. That is what to say, a mouthful to digest!

Actually the moan came from the Yahoo! Itself, and they opened a jira in earlier 2008 and the result; MRv2. For more details, read the jira itself, located at https://issues.apache.org/jira/browse/MAPREDUCE-279. The major complaints after scaling out beyond 4000 nodes where, unpredictable nature of cluster, where the cascading crash being the foremost (http://issues.apache.org/jira/browse/HADOOP-572) and the network flooding due to the back-calls sent across the network and the resultant malfunctioned TaskTrackers, over resource usages and memory/CPU consumption shooting up…

So you got the picture why we need a new structure than some bug fixes, yes MapReduce has shown its age, resulting into the materialization of YARN.

So what exactly is YARN?

To be precise, YARN stands for “Yet-Another-Resource-Negotiator”. Yup, the words in itself is self-explanatory. It is a resource negotiator, a global resource negotiator and task scheduler for the cluster. More over, it is a framework that allows us to build our own distributed processing frameworks and distributed applications. So what we have in CDH4 is a MapReduce framework built on top of YARN, which is called MRv2 or MR2.

Yes, YARN is the underlying framework of MRv2 that provides us with the necessary capabilities to schedule resources and administer the applications. So we are not limited to just MapReduce, we can develop applications that do not follow MapReduce model.

OK, So YARN and MR2 are different in theory, so what is MR2 actually?

If you are familiar with JobTracker and TaskTrackers, you know that, it was the job of JobTracker to deploy the MapReduce jobs across the cluster and schedule and monitor it. With the materialization of YARN inside Apache Hadoop, we no longer need JobTracker to schedule Jobs and manage it, and TaskTrackers to carryout tasks. The MR1 codebase is re-written to run on top of YARN, and then we have what we call, MR2.

So now we do have,

A global or cluster-wide Resource Manager
A per node slave, NodeManager &
A per application ApplicationMaster

We will talk about these in more detail in the coming days anyway. By the time, we do have a distributed shell PoC in trunk; I’m yet to see a fully YARN’ed MR2 app, keep an eye on it and stay tuned…

References:

http://hadoop.apache.org/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
http://blog.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/