Tips to run MapReduce Jobs within DIC resource limits - IERG4300/ ESTR4300 Web-Scale Information Analytics / Spring 2023

Please delete your data for task (e) if you've finished your HW. Thank you for your cooperation.
If you are taking up more than 600GB in HDFS (entire amount of data generated by all task (a)~(e)), very likely your algorithm needs optimization, or you are outputting redundant/unnecessary data to HDFS. If your files/folders in HDFS are too big, do not bother copying it to Linux file system.

PLEASE DELETE YOUR DATA from HDFS if it’s taking too much space once you have finished saving the final output. To check it out, you can view the total size of a folder in HDFS with hdfs dfs -du -s -h <folder path>, and view the size of files with hdfs dfs -ls -h <file/folder path>. If you take up too much disk space such that the normal functioning of DIC is affected, we reserve the right to delete your files on HDFS (and will inform you by email after the deletion).
For task (b) and (e), set the number of mapper and reducer tasks in data-intensive MapReduce Jobs to at least 10. You may customize that according to your needs. In Hadoop Streaming, this can be set by specifying -D mapred.map.tasks=xxx and -D mapred.reduce.tasks=xxx.
Specify -D mapreduce.map.output.compress=true to enable compression of intermediate results generated by mappers, so as to avoid extreme consumption of disk space.
Do not submit too many jobs at a time such that the DIC resources exceed the pre-defined limit for a user. If you do so, your outstanding job will wait for the previous job(s) to finish, appearing to be stuck.

You can check all uncompleted jobs submitted by yourself with the command hadoop job -list|grep s1155xxxxxx, where s1155xxxxxx is your student ID.

To kill unwanted jobs, try hadoop job -kill <job id>. But remember do not kill jobs submitted by others.