Java tutorials - Distributed computing

Hadoop

Installing Hadoop

Installing the Hadoop Distributed File System on a single machine and using it in pseudo cloud mode. Not everyone has a cluster to experiment with, but Hadoop can be installed on a single machine.

I'm going to cover how to install Hadoop on a single workstation in this tutorial. I don't currently have multiple networked machines (i'm working on it) but do have a PC running Fedora 11 which works just fine for experimentation and getting to know the different parts of the Hadoop project. Installation on Windows isn't as easy as Linux. I'm only going to look at Linux, sorry.

Getting Hadoop running

Hadoop needs to be downloaded from its Apache releases page here: Hadoop Common Releases.

Following the links on the releases page download the version of Hadoop you want. I downloaded the hadoop-0.20.1.tar.gz package from the suggested mirror. 

Unpack the downloaded package on your file system. The shell command below should do this for you ( x=extract z=gzip v=verbose f=file). I ran this using my user account and not as root.

$ tar -xzvf hadoop-0.20.1.tar.gz

Check your JAVA_HOME environment variable is set to point to your installation of Java. If its not set, then its helpful to do that at this point.

$ export JAVA_HOME=/usr/lib/jvm/jre-1.6.0
$ env | grep JAVA_HOME
JAVA_HOME=/usr/lib/jvm/jre-1.6.0 

Now you need to add the hadoop bin directory to your system path. You need to know where you unpacked the Hadoop package for this.

$ export PATH=$PATH:/home/nerd/Hadoop/hadoop-0.20.1/bin

That's all you need to to to install Hadoop. You can check it runs now by seeing what version you have.

$ hadoop version
Hadoop 0.20.1
Subversion http://svn.apache.org/........./release-0.20.1rc1 

The default configuration for Hadoop is for it to run in Standalone mode where every thing runs on a single workstation and in one JVM. This mode is fine for doing some testing and debugging MapReduce jobs. Pseudo-distributed mode simulates a cluster locally, running the Hadoop daemons. Fully-distributed mode utilizes the full power of Hadoop on a cluster of machines.

I'll add how to configure Hadoop to run in Pseudo-distributed mode soon.....