Tutorial 2

Single-node Hadoop Setup
Run a word-count MapReduce job

Single-node Hadoop Setup

Prerequisites:

1). Sun Java 7:

Hadoop 2.7.1 requires a working Java 1.7+ (aka Java 7) installation. We will install Java 7 in this tutorial.

ubuntu@master:~$ sudo apt-get update
ubuntu@master:~$ sudo apt-get install openjdk-7-jre
ubuntu@master:~$ sudo apt-get install openjdk-7-jdk

Notes:Other methods to install java on Linux.link

Modify JAVA_HOME:

ubuntu@master:~$ vim ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
ubuntu@master:~$ source  ~/.bashrc

After installation, make a quick check whether JDK is correctly set up:

ubuntu@master:~$ java -version
java version "1.7.0_111"
OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)

2). Create an User Account for Hadoop:

We do not want to run Hadoop as the root. So we will create a new user/group for hadoop related jobs.

ubuntu@master:~$ sudo useradd -m hadoop -s /bin/bash
ubuntu@master:~$ sudo passwd hadoop
ubuntu@master:~$ sudo adduser hadoop sudo

3). Install SSH:

SSH (“Secure SHell”) is a protocol for securely accessing one machine from another. Hadoop uses SSH for accessing another slaves nodes to start and manage all HDFS and MapReduce daemons. SSH server should has been already installed. Otherwise, install SSH server by:

ubuntu@master:~$ sudo apt-get install openssh-server
ubuntu@master:~$ cd ~/.ssh/
ubuntu@master:~$ ssh-keygen -t rsa
ubuntu@master:~$ cat ./id_rsa.pub >> ./authorized_keys

After installation, make a quick check whether ssh is correctly set up:

ubuntu@master:~$ ssh localhost

Install Hadoop

1). Download and extract the Hadoop package

ubuntu@master:~$ wget http://ftp.cuhk.edu.hk/pub/packages/apache.org/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
ubuntu@master:~$ sudo tar -zxf hadoop-2.7.1.tar.gz -C /usr/local
ubuntu@master:~$ cd /usr/local
ubuntu@master:~$ sudo mv ./hadoop-2.7.1 ./hadoop
ubuntu@master ~$ sudo chown -R hadoop ./hadoop

2). Check and Run a Mapreduce job

We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.

ubuntu@master ~$ su hadoop
hadoop@master:~$ cd /usr/local/hadoop
hadoop@master:~$ ./bin/hadoop version
hadoop@master:~$ mkdir ./input
hadoop@master:~$ cp ./etc/hadoop/*.xml ./input   
hadoop@master:~$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep ./input ./output 'dfs[a-z.]+'
hadoop@master:~$ cat ./output/*   
hadoop@master:~$ rm -r ./output

References

Hadoop streaming: http://hadoop.apache.org/docs/stable1/streaming.html
Write your own scripts in python: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
Generic command options: http://hadoop.apache.org/docs/stable1/streaming.html#Generic+Command+Options
Streaming commmand options: http://hadoop.apache.org/docs/stable1/streaming.html#Streaming+Command+Options
Secondary Sort Java Example: https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/

IEMS 5730
Big Data Systems and Information Processing

(Offered in 2019 Spring)

~ Tutorial 2 ~

Single-node Hadoop Setup

References