Spark on IE DIC Kubernetes Cluster

Prerequisite

Make sure you have installed the following:

Install docker on your local machine/ VM
- https://docs.docker.com/get-docker/
Download spark-2.4.7-bin-hadoop2.7.tgz from the following website and extract it on your local machine/ VM
- https://spark.apache.org/downloads.html
Install Kubectl on your local machine /VM
- https://kubernetes.io/docs/tasks/tools/
Setup IE VPN in your local machine/ VM
- https://sslvpn.ie.cuhk.edu.hk/global-protect/login.esp
- Guide: IE_SSLVPN_user_guide.pdf
- If you are not an IE student, please send an email to me and attach your Student ID and I will apply a temporary IE account for you.
Sign Up Docker Hub
- https://hub.docker.com/
Download the <sid>.config from your email. e.g. 1155049830.config
- In this tutorial, we will assume <sid> = 1155049830

Step1: Configure local kubectl to access remote Kubernetes cluster

First, make sure you have connected to the IE VPN!

Change the current directory to where the <sid>.config file locates and do the following:

$ mkdir ~/.kube

$ mv 1155049830.config ~/.kube/config

$ kubectl get pods
Error from server (Forbidden): pods is forbidden: User "321" cannot list resource "pods" in API group "" in the namespace "default"

$ kubectl get pods -n 1155049830
No resources found in 1155049830 namespace.

Step2: Create Kubernetes service account under your namespace

Again, make sure you have connected to the IE VPN!

$ kubectl create serviceaccount spark -n 1155049830
serviceaccount/spark created

$ kubectl create rolebinding spark-role --clusterrole=edit --serviceaccount=1155049830:spark --namespace=1155049830
rolebinding.rbac.authorization.k8s.io/spark-role created

$ kubectl get serviceaccounts -n 1155049830
NAME      SECRETS   AGE
default   1         4m56s
spark     1         70s

$ kubectl get rolebindings -n 1155049830
NAME            ROLE                             AGE
spark-role      ClusterRole/edit                 72s
sparkrolebind   ClusterRole/spark-cluster-role   5m51s

Step3: Build the docker image for your spark job and push it to Docker Hub

$ cd spark-2.4.7-bin-hadoop2.7
$ ./bin/docker-image-tool.sh -r docker.io/handasontam -t v1.0.0 build
...
...

$ docker images
REPOSITORY              TAG           IMAGE ID       CREATED              SIZE
handasontam/spark-r     v1.0.0        0f31917a370e   4 seconds ago        1.11GB
handasontam/spark-py    v1.0.0        d91e8923e6cb   About a minute ago   1.06GB
handasontam/spark       v1.0.0        67dcbcc53d66   2 minutes ago        558MB

$ ./bin/docker-image-tool.sh -r docker.io/handasontam -t v1.0.0 push
...
...

Step4: Submit the spark job to the IE DIC Kubernetes cluster

Make sure you have connected to the IE VPN!

$ cd spark-2.4.7-bin-hadoop2.7
$ bin/spark-submit \
     --master k8s://https://172.16.5.172:6443 \
     --deploy-mode cluster \
     --name spark-pi \
     --class org.apache.spark.examples.SparkPi \
     --conf spark.app.name=sparkpi \
     --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
     --conf spark.kubernetes.namespace=1155049830 \
     --conf spark.kubernetes.container.image=docker.io/handasontam/spark:v1.0.0 \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.7.jar \
1000

$ kubectl get pods -n 1155049830
NAME                               READY   STATUS    RESTARTS   AGE
spark-pi-1615874511548-driver      1/1     Running   0          10s
spark-pi-7bb7227839a068ca-exec-1   1/1     Running   0          4s
spark-pi-7bb7227839a068ca-exec-2   1/1     Running   0          4s

Check the logs of your submitted spark job(s)

$ kubectl get pods -n 1155049830
NAME                            READY   STATUS      RESTARTS   AGE
spark-pi-1615874511548-driver   0/1     Completed   0          2m13s

$ kubectl describe pod spark-pi-1615874511548-driver -n 1155049830
Name:         spark-pi-1615874511548-driver
Namespace:    1155049830
Priority:     0
Node:         dicvm2.ie.cuhk.edu.hk/172.16.5.187
Start Time:   Tue, 16 Mar 2021 14:01:52 +0800
Labels:       spark-app-selector=spark-baf19e3097564fe5bb4e6ebd11c1da04
              spark-role=driver
Annotations:  <none>
Status:       Succeeded
IP:           10.44.0.1
IPs:
  IP:  10.44.0.1
Containers:
  spark-kubernetes-driver:
    Container ID:  docker://d21a166c3bbb281a33b4e40e851baeceda7f0f08968a2e942551b971c7b9bab9
    Image:         docker.io/handasontam/spark:v3.1.1
    Image ID:      docker-pullable://handasontam/spark@sha256:00fe4357de12171292a1dbda028a75309f3344c0f4c00ad0c17ec5ec1e503630
    Ports:         7078/TCP, 7079/TCP, 4040/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      driver
      --properties-file
      /opt/spark/conf/spark.properties
      --class
      org.apache.spark.examples.SparkPi
      spark-internal
      1000
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 16 Mar 2021 14:01:54 +0800
      Finished:     Tue, 16 Mar 2021 14:02:10 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  1408Mi
    Requests:
      cpu:     1
      memory:  1408Mi
    Environment:
      SPARK_DRIVER_BIND_ADDRESS:   (v1:status.podIP)
      SPARK_LOCAL_DIRS:           /var/data/spark-c96c7672-b4b9-496e-8b87-ac12276c9cfc
      SPARK_CONF_DIR:             /opt/spark/conf
    Mounts:
      /opt/spark/conf from spark-conf-volume (rw)
      /var/data/spark-c96c7672-b4b9-496e-8b87-ac12276c9cfc from spark-local-dir-1 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from spark-token-n85vz (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  spark-local-dir-1:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  spark-conf-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      spark-pi-1615874511548-driver-conf-map
    Optional:  false
  spark-token-n85vz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  spark-token-n85vz
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

$ kubectl logs -f spark-pi-1615874511548-driver -n 1155049830
...
...

$ kubectl logs -f spark-pi-1615874511548-driver -n 1155049830 | grep Pi
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.44.0.1 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 1000
21/03/16 06:01:56 INFO SparkContext: Submitted application: Spark Pi
21/03/16 06:02:02 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
21/03/16 06:02:02 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 1000 output partitions
21/03/16 06:02:02 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
21/03/16 06:02:02 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
21/03/16 06:02:02 INFO DAGScheduler: Submitting 1000 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
21/03/16 06:02:09 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 7.027 s
21/03/16 06:02:09 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 7.109577 s
Pi is roughly 3.141714031417140

PySpark

Modify step 4 above as follows:

$ bin/spark-submit \
     --master k8s://https://172.16.5.172:6443 \
     --deploy-mode cluster \
     --name spark-pi \
     --conf spark.app.name=sparkpi \
     --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
     --conf spark.kubernetes.namespace=1155049830 \
     --conf spark.kubernetes.container.image=docker.io/handasontam/spark-py:v1.0.0 \
local:///opt/spark/examples/src/main/python/pi.py \
10

IERG 4330 / ESTR 4316
Programming Big Data Systems

(Offered in 2021 Spring)

~ Tutorial 7 ~