- Single-node Hadoop Setup
- Run a word-count MapReduce job
Single-node Hadoop Setup
Prerequisites:
1). Sun Java 7:
Hadoop 2.7.1 requires a working Java 1.7+ (aka Java 7) installation. We will install Java 7 in this tutorial.
ubuntu@master:~$ sudo apt-get update
ubuntu@master:~$ sudo apt-get install openjdk-7-jre
ubuntu@master:~$ sudo apt-get install openjdk-7-jdk
Notes:Other methods to install java on Linux.link
Modify JAVA_HOME:
ubuntu@master:~$ vim ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
ubuntu@master:~$ source ~/.bashrc
After installation, make a quick check whether JDK is correctly set up:
ubuntu@master:~$ java -version
java version "1.7.0_111"
OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
2). Create an User Account for Hadoop:
We do not want to run Hadoop as the root. So we will create a new user/group for hadoop related jobs.
ubuntu@master:~$ sudo useradd -m hadoop -s /bin/bash
ubuntu@master:~$ sudo passwd hadoop
ubuntu@master:~$ sudo adduser hadoop sudo
3). Install SSH:
SSH (“Secure SHell”) is a protocol for securely accessing one machine from another. Hadoop uses SSH for accessing another slaves nodes to start and manage all HDFS and MapReduce daemons. SSH server should has been already installed. Otherwise, install SSH server by:
ubuntu@master:~$ sudo apt-get install openssh-server
ubuntu@master:~$ cd ~/.ssh/
ubuntu@master:~$ ssh-keygen -t rsa
ubuntu@master:~$ cat ./id_rsa.pub >> ./authorized_keys
After installation, make a quick check whether ssh is correctly set up:
ubuntu@master:~$ ssh localhost
Install Hadoop
1). Download and extract the Hadoop package
ubuntu@master:~$ wget http://ftp.cuhk.edu.hk/pub/packages/apache.org/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
ubuntu@master:~$ sudo tar -zxf hadoop-2.7.1.tar.gz -C /usr/local
ubuntu@master:~$ cd /usr/local
ubuntu@master:~$ sudo mv ./hadoop-2.7.1 ./hadoop
ubuntu@master ~$ sudo chown -R hadoop ./hadoop
2). Check and Run a Mapreduce job
We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.
ubuntu@master ~$ su hadoop
hadoop@master:~$ cd /usr/local/hadoop
hadoop@master:~$ ./bin/hadoop version
hadoop@master:~$ mkdir ./input
hadoop@master:~$ cp ./etc/hadoop/*.xml ./input
hadoop@master:~$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep ./input ./output 'dfs[a-z.]+'
hadoop@master:~$ cat ./output/*
hadoop@master:~$ rm -r ./output
References
- Hadoop streaming: http://hadoop.apache.org/docs/stable1/streaming.html
- Write your own scripts in python: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
- Generic command options: http://hadoop.apache.org/docs/stable1/streaming.html#Generic+Command+Options
- Streaming commmand options: http://hadoop.apache.org/docs/stable1/streaming.html#Streaming+Command+Options
- Secondary Sort Java Example: https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/