Saturday, April 23, 2011

Hadoop Debugging Pains

Hadoop is an open source Map Reduce framework. On paper, it allows you to create a very scalable distributed program by filling in only a handful of functions. Its benefit lies in solving problems with very large datasets. By shaping a problem into Mappers and Reducers, a program can be divided up on any number of computers. Sounds pretty easy right? At least that’s what I thought going in.

Debugging Hadoop jobs can be a nightmare, as I have seen in the last few weeks. But I have learned a great deal about distributed programming, and how to go about using Hadoop in general.

When I started working on this project, I first began designing how I was going to transform a normal sequential program into a Map Reduce style program with key/value pairs. Looking back now, this really wasn’t the best place to start the process. The first thing I should have looked into was ensuring the Hadoop installation I was looking to use was completely stable. This would have saved me a ton of time.

Most of my debugging time was trying to decipher whether or not the error I was seeing was due to a fault with the Hadoop framework, or a fault within my own program. If I could have eliminated this program at the starting, I could have saved myself many hours of frustration. Looking back on the process now, I would advise anyone looking to play with Hadoop should first ensure the Hadoop File System (HFS) is completely stable.

Once the framework is stable and ready to go, the focus now shifts to the details of the Hadoop jobs themselves. Before jumping into the Hadoop programming, I was given two language options to use: Java or Python via Hadoop Streaming.

Since Hadoop is written using Java, it would only make sense to use Java to actually create the application itself. But by picking Java, you have to deal with Java’s faults. Making sure the inputs and outputs of multiple jobs are correct can be time consuming in itself, and they are unnecessary in Python.

With Python, all that’s needed is a simple script for the Mapper and the Reducer. Everything else is taken care of. The time required to develop the application itself is drastically decreased.

I painfully learned a few other lessons during my Hadoop time.

  • A reducer can receive multiple groups of keys, not necessarily just one
  • Most problems will require many Hadoop jobs to solve. Try to think in terms of transforming the data into a more suitable format with each job.

These are just some words of wisdom regarding Hadoop and Map/Reduce programming. By knowing these ahead of time, many hours can be saved.

1 comment:

  1. For doing several, possibly chained, map-reduce processes on a dataset, take a look into PIG and pig streaming.

    ReplyDelete