# ~ Home ~

## Description

The course discusses data-intensive analytics, and automated processing of very large amount of structured and unstructured information. We focus on leveraging the MapReduce and other related paradigms to create parallel algorithms that can be scaled up to handle massive data sets such as those collected from the World Wide Web or other Internet systems and applications. We organize the course around a list of large-scale data analytic problems in practice. The required theories and methodologies for tackling each problem will be introduced. As such, the course only expects students to have solid knowledge in probability, statistics, linear algebra and computer programming skills. Topics to be covered include: the MapReduce computational model and its system architecture and realization in practice ; Finding Frequent Item-sets and Association Rules ; Finding Similar Items in high-dimensional data ; Dimensionality Reduction techniques ; Clustering ; Recommendation systems ; Analysis of Massive Graphs and its applications on the World Wide Web ; Large-scale supervised machine learning; Processing and mining of Data Streams and their applications on large-scale network/ online-activity monitoring.

## Course Information

The lectures and the tutorials will be conducted in ZOOM:

• ZOOM meeting ID: 944 3762 5322
• For students who did not register for the course but wants to access the meeting, please email the TA to obtain the meeting password. (See below for TAs' email addresses)

Lecture time:

• WED 09:30 - 11:15
• FRI 09:30 - 11:15

Lecture time (ESTR4300):

• WED 17:30 - 18:15

Tutorial:

• Time: WED 11:30 - 12:15
• Time: THU 19:30 - 20:15

TAs Office Hours: (If you want to ask TAs for help beyond those periods, please send an email to make reservations with the TA in advance.)

• Da Sun Handason Tam: THU 21:30 - 22:15
• Siyue Xie: TUE 10:00 - 10:45
• Liu Yang: THU 15:00 - 15:45

Instructor:

• Prof. Wing Cheong Lau. wclau [at] ie [dot] cuhk [dot] edu [dot] hk
• Office hours: TBA TBA (TBA)

Teaching Assistant:

• Da Sun Handason Tam tds019 [at] ie [dot] cuhk [dot] edu [dot] hk
• Siyue Xie xs019 [at] ie [dot] cuhk [dot] edu [dot] hk
• Yang Liu ly016 [at] ie [dot] cuhk [dot] edu [dot] hk

Website account:

User: ierg4300


## Tentative Timetable

1 Sept 9, 11 Course Admin; Era of Big Data Analytics; W2-3, F2-3 [Jlin]Ch1 [DataCenter]
2 Sept 16, 18 Computing as a Utility; Data-center Architecture W2-3, F2-3 [MMDS]Ch1 -
3-4 Sept 23, 25, 30 MapReduce/ Hadoop ; The Big Data Processing stack W2-3, F2-3, W2-3 [MMDS]Ch2.1-2.4; [JLin]Ch2; [JLin]Ch3.1-3.4 [CloudData]
**Oct 2 Public holiday: The day following the Chinese Mid-Autumn Festival**
4-5 Oct 7, 9 Frequent Item-Set Mining and Association Rules W2-3, F2-3 [MMDS]Ch6.1-6.4 -
5-6 Oct 14, 16, 21 Finding Similar Items and LSH W2-3, F2-3, W2-3 [MMDS]Ch3.1-3.5 [ZG]
6-7 Oct 23, 28, 30 Clustering and GMM F2-3, W2-3, F2-3 [MMDS] Ch7.1-7.4 [MMDS] Ch11, [CBishop] Ch.9, [MLE/MAP] -
8 Nov 4, 6 Dimension Reduction W2-3, F2-3 [MMDS] Ch11 [PCA], [GuruswamiKannan]
9 Nov 11, 13 Recommendation Systems W2-3, F2-3 [SVDPCA], [ANgCS229PCA], [ShaliziADAEPV]Ch17 ; -
10-11 Nov 18, 20 Regression and Gradient Descent ; Recommendation Systems (cont'd) W2-3, F2-3 [MMDS] Ch9 [Netflix09]; [KorenTalk]; [ANg]
12 Nov 25, 27 Data Stream Algorithms W2-3, F2-3 [MMDS] Ch4.1-4.5 ; -
13 Dec 2, 4 Data Stream Algorithms (cont'd) W2-3, F2-3 [ChakDataStream] Ch0,Ch1,Ch4.4,Ch6 ; -

## Course Assessment

For IERG4300:

• Homework (5 sets in total): 65%
• Q&A Design Assignment(s): 25%
• Class Participation 10%

For ESTR4300:

NEW course assessment scheme:

• Homework (5 sets in total): 67%
• Q&A Design Assignment(s): 22%
• Class Participation: 11%

OLD course assessment scheme:

* Homework (5 sets in total): 60%

* Q&A Design Assignment(s): 20%

* Oral Presentation: 10%

* Class Participation 10%

## Guidelines for Q&A assignment

The "Q&A" assignment asks each student to design and submit a set of questions AND model-answers/suggested solutions for a future 2-hr-long final examination of IERG4300. To avoid asking trivial questions which merely test the memorization ability of the exam takers, you should assume the exam to be an open-book/open-note exam or an exam which allows a student to bring a couple of pages of cheat-sheet into the examination venue.

• Originality and thoughtfulness of the questions, i.e., the questions should be non-trivial and be able to highlight and test/promote the most important concepts/ ideas/ techniques which have been taught in our class.
• Correctness of the suggested solutions/ model answers.
• Comprehensive nature (or the lack of), i.e. your set of questions together, should cover multiple (the more, the better) key concepts/ ideas/ techniques taught in our class so far. In other words, setting a single MapReduce question to take up the entire 2-hr exam period won't be a good choice.
• Suitability of the overall set of questions for a time-limited 2hr exam. In other words, it should be reasonable for a student to complete your proposed set of questions within a 2hr limit.
• Diversity of the questions in terms of their difficulties to differentiate students with different level of competence on the subject being tested.

Since the originality and thoughtfulness of the proposed questions are the key considerations, you MUST NOT copy or merely adapt/re-phrase questions found elsewhere (i.e. from past papers of IERG4300 or similar courses or textbooks) and submit as your own work. Instead, study our course materials and then ask yourself which are the most important concepts you have learned from this course and try to design a related question for each (some) of those concepts to promote/ strengthen a student's understanding of such concept. In other words, you should view your questions as training exercises for the exam taker.

## Student/Faculty Expectations on Teaching and Learning

http://mobitec.ie.cuhk.edu.hk/StaffStudentExpectations.pdf

You are expected to do your own work and acknowledge the use of anyone else's words or ideas. You MUST put down in your submitted work the names of people with whom you have had discussions.