~ Home ~
Description
The course discusses data-intensive analytics, and automated processing of very large amount of structured and unstructured information. We focus on leveraging the MapReduce and other related paradigms to create parallel algorithms that can be scaled up to handle massive data sets such as those collected from the World Wide Web or other Internet systems and applications. We organize the course around a list of large-scale data analytic problems in practice. The required theories and methodologies for tackling each problem will be introduced. As such, the course only expects students to have solid knowledge in probability, statistics, linear algebra and computer programming skills. Topics to be covered include: the MapReduce computational model and its system architecture and realization in practice ; Finding Frequent Item-sets and Association Rules ; Finding Similar Items in high-dimensional data ; Dimensionality Reduction techniques ; Clustering ; Recommendation systems ; Analysis of Massive Graphs and its applications on the World Wide Web ; Large-scale supervised machine learning; Processing and mining of Data Streams and their applications on large-scale network/ online-activity monitoring.
Course Information
The lectures and the tutorials will be conducted in ZOOM:
- ZOOM meeting link:https://cuhk.zoom.us/j/94437625322
- ZOOM meeting ID:
944 3762 5322
- For students who did not register for the course but wants to access the meeting, please email the TA to obtain the meeting password. (See below for TAs' email addresses)
Lecture time:
WED
09:30 - 11:15FRI
09:30 - 11:15
Lecture time (ESTR4300):
WED
17:30 - 18:15
Tutorial:
- Time:
WED
11:30 - 12:15 - Time:
THU
19:30 - 20:15
TAs Office Hours: (If you want to ask TAs for help beyond those periods, please send an email to make reservations with the TA in advance.)
- Da Sun Handason Tam:
THU
21:30 - 22:15 - Siyue Xie:
TUE
10:00 - 10:45 - Liu Yang:
THU
15:00 - 15:45
Instructor:
- Prof. Wing Cheong Lau.
wclau [at] ie [dot] cuhk [dot] edu [dot] hk
- Office hours:
TBA
TBA (TBA)
Teaching Assistant:
- Da Sun Handason Tam
tds019 [at] ie [dot] cuhk [dot] edu [dot] hk
- Siyue Xie
xs019 [at] ie [dot] cuhk [dot] edu [dot] hk
- Yang Liu
ly016 [at] ie [dot] cuhk [dot] edu [dot] hk
Website account:
User: ierg4300
Password: fall2020ierg
Highly Recommended Textbooks
[MMDS] Mining of Massive Datasets (Download version 1.3) by Anand Rajaraman, Jeff Ullman and Jure Leskovec, Cambridge University Press. Latest version can be downloaded from Mining of Massive Datasets.pdf
[JLin] Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, Morgan and Claypool Publishers, 2010, can be freely downloaded from http://lintool.github.io/MapReduceAlgorithms/
[CBishop] Pattern Recognition and Machine Learning by Christopher M. Bishop, Published by Springer Science and Business, 2007.
[MLE/MAP] Estimating Probabilities: MLE and MAP http://www.cs.cmu.edu/~tom/mlbook/Joint_MLE_MAP.pdf
[HTF] Elements of Statistical Learning 2nd Edition by Trevor Hastie, Robert Tibshirani, Jerome H. Friedman, Published by Springer, 2009. Ebook version can be downloaded from: http://link.springer.com/book/10.1007/978-0-387-84858-7 via a CUHK IP address
[JWHT] An Introduction to Statistical Learning with Applications in R, by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Published by Springer, 2013. Ebook version can be downloaded from: http://link.springer.com/book/10.1007/978-1-4614-7138-7 via a CUHK IP address
[PCA] Principal Component Analysis, 2nd Edition, by I.T. Jolliffe, Published by Springer 2002, Ebook version can be download from: http://www.springerlink.com/content/h41v76/?p=e8e028e1c9ba414690c9179ee7c0e388&pi=3 via a CUHK IP address
[ShaliziADAEPV] Cosma Rohilla Shalizi, "Advanced Data Analysis from an Elementary Point of View", Cambridge University Press, 2014. Draft available for download from: http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/
[ShaprioStockman] Shaprio and Stockman, Computer Vision, 2000, Chapter 4.2-4.9, https://courses.cs.washington.edu/courses/cse576/book/ch4.pdf
[Blum] Blum, Avrim, John Hopcroft, and Ravindran Kannan. "Foundations of Data Science." (2017): https://www.cs.cornell.edu/jeh/book.pdf
Tentative Timetable
Week | Lecture Date | Topic | Period | Recommended Readings | Additional References |
---|---|---|---|---|---|
1 | Sept 9, 11 | Course Admin; Era of Big Data Analytics; | W2-3, F2-3 | [Jlin]Ch1 | [DataCenter] |
2 | Sept 16, 18 | Computing as a Utility; Data-center Architecture | W2-3, F2-3 | [MMDS]Ch1 | - |
3-4 | Sept 23, 25, 30 | MapReduce/ Hadoop ; The Big Data Processing stack | W2-3, F2-3, W2-3 | [MMDS]Ch2.1-2.4; [JLin]Ch2; [JLin]Ch3.1-3.4 | [CloudData] |
**Oct 2 Public holiday: The day following the Chinese Mid-Autumn Festival** | |||||
4-5 | Oct 7, 9 | Frequent Item-Set Mining and Association Rules | W2-3, F2-3 | [MMDS]Ch6.1-6.4 | - |
5-6 | Oct 14, 16, 21 | Finding Similar Items and LSH | W2-3, F2-3, W2-3 | [MMDS]Ch3.1-3.5 | [ZG] |
6-7 | Oct 23, 28, 30 | Clustering and GMM | F2-3, W2-3, F2-3 | [MMDS] Ch7.1-7.4 [MMDS] Ch11, [CBishop] Ch.9, [MLE/MAP] | - | 8 | Nov 4, 6 | Dimension Reduction | W2-3, F2-3 | [MMDS] Ch11 | [PCA], [GuruswamiKannan] |
9 | Nov 11, 13 | Recommendation Systems | W2-3, F2-3 | [SVDPCA], [ANgCS229PCA], [ShaliziADAEPV]Ch17 ; | - |
10-11 | Nov 18, 20 | Regression and Gradient Descent ; Recommendation Systems (cont'd) | W2-3, F2-3 | [MMDS] Ch9 | [Netflix09]; [KorenTalk]; [ANg] |
12 | Nov 25, 27 | Data Stream Algorithms | W2-3, F2-3 | [MMDS] Ch4.1-4.5 ; | - |
13 | Dec 2, 4 | Data Stream Algorithms (cont'd) | W2-3, F2-3 | [ChakDataStream] Ch0,Ch1,Ch4.4,Ch6 ; | - |
Course Assessment
Your grade will be based on the following components:
For IERG4300:
- Homework (5 sets in total): 65%
- Q&A Design Assignment(s): 25%
- Class Participation 10%
For ESTR4300:
NEW course assessment scheme:
- Homework (5 sets in total): 67%
- Q&A Design Assignment(s): 22%
- Class Participation: 11%
OLD course assessment scheme:
* Homework (5 sets in total): 60%
* Q&A Design Assignment(s): 20%
* Oral Presentation: 10%
* Class Participation 10%
Guidelines for Q&A assignment
The "Q&A" assignment asks each student to design and submit a set of questions AND model-answers/suggested solutions for a future 2-hr-long final examination of IERG4300. To avoid asking trivial questions which merely test the memorization ability of the exam takers, you should assume the exam to be an open-book/open-note exam or an exam which allows a student to bring a couple of pages of cheat-sheet into the examination venue.
Your submission will be graded according to its:
- Originality and thoughtfulness of the questions, i.e., the questions should be non-trivial and be able to highlight and test/promote the most important concepts/ ideas/ techniques which have been taught in our class.
- Correctness of the suggested solutions/ model answers.
- Comprehensive nature (or the lack of), i.e. your set of questions together, should cover multiple (the more, the better) key concepts/ ideas/ techniques taught in our class so far. In other words, setting a single MapReduce question to take up the entire 2-hr exam period won't be a good choice.
- Suitability of the overall set of questions for a time-limited 2hr exam. In other words, it should be reasonable for a student to complete your proposed set of questions within a 2hr limit.
Diversity of the questions in terms of their difficulties to differentiate students with different level of competence on the subject being tested.
Since the originality and thoughtfulness of the proposed questions are the key considerations, you MUST NOT copy or merely adapt/re-phrase questions found elsewhere (i.e. from past papers of IERG4300 or similar courses or textbooks) and submit as your own work. Instead, study our course materials and then ask yourself which are the most important concepts you have learned from this course and try to design a related question for each (some) of those concepts to promote/ strengthen a student's understanding of such concept. In other words, you should view your questions as training exercises for the exam taker.
Student/Faculty Expectations on Teaching and Learning
http://mobitec.ie.cuhk.edu.hk/StaffStudentExpectations.pdf
Academic Honesty
You are expected to do your own work and acknowledge the use of anyone else's words or ideas. You MUST put down in your submitted work the names of people with whom you have had discussions.
Refer to http://www.cuhk.edu.hk/policy/academichonesty for details
When scholastic dishonesty is suspected, the matter will be turned over to the University authority for action.
You MUST include the following signed statement in all of your submitted homework, project assignments and examinations. Submission without a signed statement will not be graded.
I declare that the assignment here submitted is original except for source material explicitly acknowledged, and that the same or related material has not been previously submitted for another course. I also acknowledge that I am aware of University policy and regulations on honesty in academic work, and of the disciplinary guidelines and procedures applicable to breaches of such policy and regulations, as contained in the website http://www.cuhk.edu.hk/policy/academichonesty/.