Spark on IE DIC Kubernetes Cluster
Prerequisite
Make sure you have installed the following:
- Install docker on your local machine/ VM
- Download
spark-2.4.7-bin-hadoop2.7.tgz
from the following website and extract it on your local machine/ VM - Install Kubectl on your local machine /VM
- Setup IE VPN in your local machine/ VM
- https://sslvpn.ie.cuhk.edu.hk/global-protect/login.esp
- Guide: IE_SSLVPN_user_guide.pdf
- If you are not an IE student, please send an email to me and attach your Student ID and I will apply a temporary IE account for you.
- Sign Up Docker Hub
- Download the
<sid>.config
from your email. e.g. 1155049830.config- In this tutorial, we will assume
<sid> = 1155049830
- In this tutorial, we will assume
Step1: Configure local kubectl to access remote Kubernetes cluster
First, make sure you have connected to the IE VPN!
Change the current directory to where the <sid>.config
file locates and do the following:
$ mkdir ~/.kube
$ mv 1155049830.config ~/.kube/config
$ kubectl get pods
Error from server (Forbidden): pods is forbidden: User "321" cannot list resource "pods" in API group "" in the namespace "default"
$ kubectl get pods -n 1155049830
No resources found in 1155049830 namespace.
Step2: Create Kubernetes service account under your namespace
Again, make sure you have connected to the IE VPN!
$ kubectl create serviceaccount spark -n 1155049830
serviceaccount/spark created
$ kubectl create rolebinding spark-role --clusterrole=edit --serviceaccount=1155049830:spark --namespace=1155049830
rolebinding.rbac.authorization.k8s.io/spark-role created
$ kubectl get serviceaccounts -n 1155049830
NAME SECRETS AGE
default 1 4m56s
spark 1 70s
$ kubectl get rolebindings -n 1155049830
NAME ROLE AGE
spark-role ClusterRole/edit 72s
sparkrolebind ClusterRole/spark-cluster-role 5m51s
Step3: Build the docker image for your spark job and push it to Docker Hub
$ cd spark-2.4.7-bin-hadoop2.7
$ ./bin/docker-image-tool.sh -r docker.io/handasontam -t v1.0.0 build
...
...
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
handasontam/spark-r v1.0.0 0f31917a370e 4 seconds ago 1.11GB
handasontam/spark-py v1.0.0 d91e8923e6cb About a minute ago 1.06GB
handasontam/spark v1.0.0 67dcbcc53d66 2 minutes ago 558MB
$ ./bin/docker-image-tool.sh -r docker.io/handasontam -t v1.0.0 push
...
...
Step4: Submit the spark job to the IE DIC Kubernetes cluster
Make sure you have connected to the IE VPN!
$ cd spark-2.4.7-bin-hadoop2.7
$ bin/spark-submit \
--master k8s://https://172.16.5.172:6443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.app.name=sparkpi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.namespace=1155049830 \
--conf spark.kubernetes.container.image=docker.io/handasontam/spark:v1.0.0 \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.7.jar \
1000
$ kubectl get pods -n 1155049830
NAME READY STATUS RESTARTS AGE
spark-pi-1615874511548-driver 1/1 Running 0 10s
spark-pi-7bb7227839a068ca-exec-1 1/1 Running 0 4s
spark-pi-7bb7227839a068ca-exec-2 1/1 Running 0 4s
Check the logs of your submitted spark job(s)
$ kubectl get pods -n 1155049830
NAME READY STATUS RESTARTS AGE
spark-pi-1615874511548-driver 0/1 Completed 0 2m13s
$ kubectl describe pod spark-pi-1615874511548-driver -n 1155049830
Name: spark-pi-1615874511548-driver
Namespace: 1155049830
Priority: 0
Node: dicvm2.ie.cuhk.edu.hk/172.16.5.187
Start Time: Tue, 16 Mar 2021 14:01:52 +0800
Labels: spark-app-selector=spark-baf19e3097564fe5bb4e6ebd11c1da04
spark-role=driver
Annotations: <none>
Status: Succeeded
IP: 10.44.0.1
IPs:
IP: 10.44.0.1
Containers:
spark-kubernetes-driver:
Container ID: docker://d21a166c3bbb281a33b4e40e851baeceda7f0f08968a2e942551b971c7b9bab9
Image: docker.io/handasontam/spark:v3.1.1
Image ID: docker-pullable://handasontam/spark@sha256:00fe4357de12171292a1dbda028a75309f3344c0f4c00ad0c17ec5ec1e503630
Ports: 7078/TCP, 7079/TCP, 4040/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
driver
--properties-file
/opt/spark/conf/spark.properties
--class
org.apache.spark.examples.SparkPi
spark-internal
1000
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 16 Mar 2021 14:01:54 +0800
Finished: Tue, 16 Mar 2021 14:02:10 +0800
Ready: False
Restart Count: 0
Limits:
memory: 1408Mi
Requests:
cpu: 1
memory: 1408Mi
Environment:
SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
SPARK_LOCAL_DIRS: /var/data/spark-c96c7672-b4b9-496e-8b87-ac12276c9cfc
SPARK_CONF_DIR: /opt/spark/conf
Mounts:
/opt/spark/conf from spark-conf-volume (rw)
/var/data/spark-c96c7672-b4b9-496e-8b87-ac12276c9cfc from spark-local-dir-1 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from spark-token-n85vz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
spark-local-dir-1:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
spark-conf-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: spark-pi-1615874511548-driver-conf-map
Optional: false
spark-token-n85vz:
Type: Secret (a volume populated by a Secret)
SecretName: spark-token-n85vz
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
$ kubectl logs -f spark-pi-1615874511548-driver -n 1155049830
...
...
$ kubectl logs -f spark-pi-1615874511548-driver -n 1155049830 | grep Pi
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.44.0.1 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 1000
21/03/16 06:01:56 INFO SparkContext: Submitted application: Spark Pi
21/03/16 06:02:02 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
21/03/16 06:02:02 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 1000 output partitions
21/03/16 06:02:02 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
21/03/16 06:02:02 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
21/03/16 06:02:02 INFO DAGScheduler: Submitting 1000 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
21/03/16 06:02:09 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 7.027 s
21/03/16 06:02:09 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 7.109577 s
Pi is roughly 3.141714031417140
PySpark
Modify step 4 above as follows:
$ bin/spark-submit \
--master k8s://https://172.16.5.172:6443 \
--deploy-mode cluster \
--name spark-pi \
--conf spark.app.name=sparkpi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.namespace=1155049830 \
--conf spark.kubernetes.container.image=docker.io/handasontam/spark-py:v1.0.0 \
local:///opt/spark/examples/src/main/python/pi.py \
10