Tutorial 2

Tutorial Logistics

Static Sites
Build Hadoop with Docker
Run jobs

Static Sites

The image that we are going to use is a single-page website that I've already created for the purpose of this demo and hosted on the registry - prakhar1989/static-site. We can download and run the image directly in one go using docker run.

$ docker run prakhar1989/static-site
Digest: sha256:48ca6254c16b81a7960ad874e901f027fbcaac66509cfd79ce8f3f6da424d3b1

Since the image doesn't exist locally, the client will first fetch the image from the registry and then run the image. If all goes well, you should see a Nginx is running... message in your terminal. Okay now that the server is running, how do see the website? What port is it running on? And more importantly, how do we access the container directly from our host machine?

Well in this case, the client is not exposing any ports so we need to re-run the docker run command to publish ports. While we're at it, we should also find a way so that our terminal is not attached to the running container. This way, you can happily close your terminal and keep the container running. This is called detached mode.

$ docker run -d -P --name static-site prakhar1989/static-site

In the above command, -d will detach our terminal, -P will publish all exposed ports to random ports and finally --name corresponds to a name we want to give. Now we can see the ports by running the docker port [CONTAINER] command

$ docker port static-site
443/tcp -> 0.0.0.0:32768
80/tcp -> 0.0.0.0:32769

You can open http://localhost:32769 in your browser.

Note: If you're using docker-toolbox, then you might need to use docker-machine ip default to get the IP.

You can also specify a custom port to which the client will forward connections to the container.

$ docker run -p 8888:80 prakhar1989/static-site
Nginx is running...

To stop a detached container, run docker stop by giving the container ID.

I'm sure you agree that was super simple. To deploy this on a real server you would just need to install Docker, and run the above Docker command. Now that you've seen how to run a webserver inside a Docker image, you must be wondering - how do I create my own Docker image? This is the question we'll be exploring in the next section.

Build Hadoop with Docker

We've looked at images before, but in this section we'll dive deeper into what Docker images are and build our own image!

Docker images are the basis of containers. To see the list of images that are available locally, use the docker images command.

# docker images
REPOSITORY                TAG                 IMAGE ID            CREATED             SIZE
<none>                    <none>              aef319ff5fb2        2 days ago          643MB
ubuntu                    spark               7f1e3360a281        2 days ago          2.26GB
ubuntu                    hadoop              507c6b330971        2 days ago          933MB
ubuntu                    16.04               7e87e2b3bf7a        4 days ago          117MB
mesosphere/spark          latest              5c25c7985707        14 months ago       1.42GB
java                      openjdk-8-jdk       d23bdf5b1b1b        2 years ago         643MB
prakhar1989/static-site   latest              f01030e1dcf3        3 years ago         134MB

The above gives a list of images that I've pulled from the registry, along with ones that I've created myself (we'll shortly see how). The TAG refers to a particular snapshot of the image and the IMAGE ID is the corresponding unique identifier for that image.

For simplicity, you can think of an image akin to a git repository - images can be committed with changes and have multiple versions. If you don't provide a specific version number, the client defaults to latest. For example, you can pull a specific version of ubuntu image

# docker pull ubuntu:16.04

To get a new Docker image you can either get it from a registry (such as the Docker Hub) or create your own. There are tens of thousands of images available on Docker Hub. You can also search for images directly from the command line using docker search.

After getting a Docker images, you can run it as follows.

# docker run -ti ubuntu:16.04
root@adb26705d7f1:/#

Install Java

root@adb26705d7f1:/# apt update
root@adb26705d7f1:/# apt install software-properties-common python-software-properties
root@adb26705d7f1:/# add-apt-repository ppa:webupd8team/java
root@adb26705d7f1:/# apt update
root@adb26705d7f1:/# apt install oracle-java8-installer
root@adb26705d7f1:/# java -version

After installing Java in ubuntu:16.04, we need to install the following tools in the Docker images.

root@adb26705d7f1:/# apt update
root@adb26705d7f1:/# apt install wget
root@adb26705d7f1:/# apt install vim
root@adb26705d7f1:/# apt install net-tools       # ifconfig 
root@adb26705d7f1:/# apt install iputils-ping     # ping

To install the cluster environment, we need to install SSH and rsync.

root@adb26705d7f1:/# apt install ssh 
root@adb26705d7f1:/# apt install rsync

Setup passphraseless ssh. In the Docker, there is a litter different to set up the ssh. First, you need to build the following directory and run:

root@adb26705d7f1:/# mkdir /var/run/sshd
root@adb26705d7f1:/# /usr/sbin/sshd

It is because after we run the container, we need open the ssh server manually. Then, to set up passphraseless ssh, we use the following commands.

root@adb26705d7f1:/# cd ~/
root@adb26705d7f1:/# ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
root@adb26705d7f1:/# cd .ssh
root@adb26705d7f1:/# cat id_rsa.pub >> authorized_keys

Test ssh and exit:

root@adb26705d7f1:/# ssh localhost
root@adb26705d7f1:/# exit

Install Hadoop:

root@adb26705d7f1:/# wget https://www-us.apache.org/dist/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz
root@adb26705d7f1:/# tar -zxvf hadoop-2.9.2.tar.gz

Set environment variables:

root@adb26705d7f1:/# vim ~/.bashrc
  export HADOOP_HOME=/hadoop-2.9.2
  export JAVA_HOME=/usr/lib/jvm/java-8-oracle
  export PATH=$HADOOP_HOME/bin:$PATH:$JAVA_HOME/bin
root@adb26705d7f1:/# source ~/.bashrc

Config hadoop env:

root@adb26705d7f1:/# cd hadoop-2.9.2
root@adb26705d7f1:/# vim etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Config core-site.xml:

root@adb26705d7f1:/# vim etc/hadoop/core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

Config hdfs-site.xml:

root@adb26705d7f1:/# vim etc/hadoop/hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Format the filesystem:

root@adb26705d7f1:/# bin/hdfs namenode -format

Start hdfs:

root@adb26705d7f1:/# ./sbin/start-dfs.sh

Check the dfshealth page. host: adb26705d7f1/172.17.0.4

open: 172.17.0.4:50070

After installing the all the environment variables in the image, we can save it to build other images based on the above image.

root@adb26705d7f1:/hadoop-2.9.2# exit
root@wutong:~# docker commit -m "hadoop install" adb26705d7f1 ubuntu:hadoop0127
root@wutong:~# docker images
REPOSITORY                TAG                 IMAGE ID            CREATED             SIZE
ubuntu                    hadoop0127          41b876015594        29 seconds ago      2.17GB
<none>                    <none>              aef319ff5fb2        2 days ago          643MB
ubuntu                    spark               7f1e3360a281        2 days ago          2.26GB
ubuntu                    hadoop              507c6b330971        2 days ago          933MB
ubuntu                    16.04               7e87e2b3bf7a        4 days ago          117MB
mesosphere/spark          latest              5c25c7985707        14 months ago       1.42GB
java                      openjdk-8-jdk       d23bdf5b1b1b        2 years ago         643MB
prakhar1989/static-site   latest              f01030e1dcf3        3 years ago         134MB

Run jobs

Enter a saved image:

root@wutong:~# docker run -it ubuntu:hadoop0127 /bin/bash
root@1676378bda6f:/#

Make the HDFS directories required to execute MapReduce jobs:

bin/hdfs dfs -mkdir /user

Teragen:

# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar teragen 100000 terasort/input

Supplements

Open/Delete Docker images:

(Linux) # docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
ubuntu              hadoop              c4947022e03e        About an hour ago   2.18GB
ubuntu              java                71a751b9551c        3 hours ago         933MB
ubuntu              16.04               7e87e2b3bf7a        12 hours ago        117MB
hello-world         latest              fce289e99eb9        3 weeks ago         1.84kB
busybox             latest              3a093384ac30        3 weeks ago         1.2MB

(Linux) # docker run -it ubuntu:java /bin/bash 

(Linux) # docker rmi fce289e99eb9

Open/Stop Docker container:

(Linux) $ sudo su -

(Linux) # docker ps -a

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                      PORTS               NAMES
996b7da00089        ubuntu:java         "/bin/bash"         11 minutes ago      Up 11 minutes                                   thirsty_bose
448e80221a66        ubuntu:16.04        "/bin/bash"         About an hour ago   Up About an hour                                inspiring_chatelet
75b6b0563555        ubuntu:java         "/bin/bash"         3 hours ago         Exited (0) 14 minutes ago                       elastic_elbakyan
860d21c19e1b        ubuntu:java         "/bin/bash"         3 hours ago         Exited (0) 2 hours ago                          suspicious_panini
aa9d3c99c3f3        ubuntu:16.04        "/bin/bash"         3 hours ago         Exited (127) 3 hours ago                        condescending_rosalind
f2c276cb124a        ubuntu:16.04        "/bin/bash"         5 hours ago         Exited (0) 3 hours ago                          admiring_yalow

(Linux) # docker stop $CONTAINER_ID

(Linux) # docker start $CONTAINER_ID

References

Docker installation: https://www.docker.com/get-started
Previous Tutorial: http://mobitec.ie.cuhk.edu.hk/ierg4330Spring2017/tutorial/tutorial3/tutorial3.html

IERG 4330 / ESTR 4316
Programming Big Data Systems

(Offered in 2019 Spring)

~ Tutorial 2 ~