IE DIC Information - IEMS5730 Big Data Systems and Information Processing / Spring 2024

Instructions on IE DIC Cluster

We have set up the IE DIC (Data-Intensive Cluster) account for you. For students who cannot setup the single node Hadoop cluster in HW#0, please contact the TAs. You can either choose to set up a single node hadoop cluster with TAs’ help or you can use the IE DIC Cluster account to run MapReduce programs.

By now, Hadoop is installed in the IE DIC Cluster and you can login the cluster to submit jobs via the following command:

ssh s[your student_id]@dicvmd10.ie.cuhk.edu.hk

where student_id is your student ID number. You can find the password on the My Grades page of the elearning system.

Note that this machine can only be accessed within IE network. You can follow the instruction document, which is placed in the elearning system under the Course Contents directory, to setup IE VPN using your IE account.

For those who are from other departments and would like to use the DIC cluster, please contact TAs to get a temporary IE account.

Please note that IE DIC can only be used for your homework. Any user that is found to use it for other purposes will be removed from the system immediately and be punished. To better allocate resources, a program/ job will be terminated without notification if it consumes more than 25GB RAM.

The overview of the DIC cluster is provided below.

Cluster Overview:

10 nodes,
Memory: 256 GB * 10
Virtual CPU Cores: 32 * 10
Disk: 10 TB
Resource management platform: YARN
Installed applications: MapReduce

Login the cluster via: ssh s[your student_id]@dicvmd10.ie.cuhk.edu.hk
The cluster can only be accessed within the IE network. You can follow the instruction document to set up your IE VPN.

Find the logs of applications:

Users can find the logs of all applications in the cluster via the web UI: http://dicvmd2.ie.cuhk.edu.hk:19888/
Users can find the details of a particular application via the web UI: e.g., http://dicvmd2.ie.cuhk.edu.hk:19888/jobhistory/job/job_1694578679658_0003, where job_1694578679658_0003 is the ID of the job you created.
The log information of an application includes
- How many containers are allocated
- The scheduling time and the completion time of each container
- The stderr file which can help you to find bugs of your code.

Some Useful URLs and Paths for completing HW#1 on IE DIC

YARN Resource Manager WebUI: http://dicvmd1.ie.cuhk.edu.hk:8088 View overall information for each (both ongoing and history) job
YARN Job History Service: http://dicvmd2.ie.cuhk.edu.hk:19888 View detailed information (incl. Program stderr & Hadoop System logs) for each history job and mapper/reducer task

SSH Connection/ SFTP File Transfer:
- Username is s+SID, e.g. s1155123456
- Password is posted on CUHK Blackboard
- Server Address is dicvmd10.ie.cuhk.edu.hk
Hadoop Streaming Jar: located at /usr/hdp/2.6.5.0-292/hadoop-mapreduce/hadoop-streaming.jar

Run with Resource Limits

If you are taking up more than 200GB in HDFS (entire amount of data generated by all task(a)~(e)), very likely your algorithm needs optimization, or you are outputting redundant/unnecessary data to HDFS. If your files/folders on HDFS are too big, do not bother copying it to Linux file system.

PLEASE DELETE YOUR DATA from HDFS if it’s taking too much space once you have finished saving the final output. To check it out, you can view the total size of a folder in HDFS with hdfs dfs -du -s -h <folder path>, and view the size of files with hdfs dfs -ls -h <file/folder path>. If you take up too much disk space such that the normal functioning of DIC is affected, we reserve the right to delete your files on HDFS without asking (and will inform you after the deletion).
For HW1 Q1 task (b) and (e), set the number of mapper and reducer tasks in data-intensive MapReduce Jobs to at least 10. You may customize that according to your needs. In Hadoop Streaming, this can be set by specifying -D mapred.map.tasks=xxx and -D mapred.reduce.tasks=xxx.
Specify -D mapreduce.map.output.compress=true to enable compression of intermediate results generated by mappers, so as to avoid taking up too much disk space.
Do not submit too many jobs at a time such that the DIC resources exceed the pre-defined limit for a user. If you do so, your outstanding job will wait for the previous job(s) to finish, appearing to be stuck.

You can check all uncompleted jobs submitted by yourself with the command hadoop job -list|grep s1155xxxxxx, where s1155xxxxxx is your student ID.

To kill unwanted jobs, try hadoop job -kill <job id>.

Instructions on IE DIC Cluster

Cluster Overview:

Cluster login:

Find the logs of applications:

Some Useful URLs and Paths for completing HW#1 on IE DIC

Run with Resource Limits