~ Home ~
Description
The course discusses data-intensive analytics, and automated processing of very large amount of structured and unstructured information. We focus on leveraging the MapReduce and other related paradigms to create parallel algorithms that can be scaled up to handle massive data sets such as those collected from the World Wide Web or other Internet systems and applications. We organize the course around a list of large-scale data analytic problems in practice. The required theories and methodologies for tackling each problem will be introduced. As such, the course only expects students to have solid knowledge in probability, statistics, linear algebra and computer programming skills. Topics to be covered include: the MapReduce computational model and its system architecture and realization in practice ; Finding Frequent Item-sets and Association Rules ; Finding Similar Items in high-dimensional data ; Dimensionality Reduction techniques ; Clustering ; Recommendation systems ; Analysis of Massive Graphs and its applications on the World Wide Web ; Large-scale supervised machine learning; Processing and mining of Data Streams and their applications on large-scale network/ online-activity monitoring.
Course Information
Lecture time and venue:
TUE
9:30am - 10:15am, ERB 703THUR
9:30am - 11:15am, ERB 703
Tutorial:
TUE
6:30pm - 7:15pm
Instructor:
- Prof. Wing Cheong Lau.
wclau [at] ie [dot] cuhk [dot] edu [dot] hk
- Office hours: Tue 10:45am to 12:15pm or By Appointment
Teaching Assistant:
- YANG Ronghai
yr013 [at] ie [dot] cuhk [dot] edu [dot] hk
- Office hour: Tue 7:15 pm - 8:15 pm
Website account:
User: engg4030
Password: fall4030engg
Highly Recommended Textbooks
[MMDS] Mining of Massive Datasets (Download version 1.3) by Anand Rajaraman, Jeff Ullman and Jure Leskovec, Cambridge University Press. Latest version can be downloaded from http://i.stanford.edu/~ullman/mmds.html#latest
[JLin] Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, Morgan and Claypool Publishers, 2010, can be freely downloaded from http://lintool.github.io/MapReduceAlgorithms/
[CBishop] Pattern Recognition and Machine Learning by Christopher M. Bishop, Published by Springer Science and Business, 2007.
[ANg] Machine Learning by Andrew Ng, Stanford CS229 Course Notes, http://cs229.stanford.edu/materials.html
[AMoore] Statistical Data Mining Tutorials – Tutorial Slides by Andrew W. Moore, http://www.autonlab.org/tutorials/list.html
[MJordan] Introduction to Graphical Models by Michael Jordan and Chris Bishop, http://people.csail.mit.edu/yks/documents/classes/mlbook
[HTF] Elements of Statistical Learning 2nd Edition by Trevor Hastie, Robert Tibshirani, Jerome H. Friedman, Published by Springer, 2009. Ebook version can be downloaded from: http://link.springer.com/book/10.1007/978-0-387-84858-7 via a CUHK IP address
[JWHT] An Introduction to Statistical Learning with Applications in R, by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Published by Springer, 2013. Ebook version can be downloaded from: http://link.springer.com/book/10.1007/978-1-4614-7138-7 via a CUHK IP address
[PCA] Principal Component Analysis, 2nd Edition, by I.T. Jolliffe, Published by Springer 2002, http://www.springerlink.com/content/h41v76/?p=e8e028e1c9ba414690c9179ee7c0e388&pi=3
[ShaliziADAEPV] Cosma Rohilla Shalizi, "Advanced Data Analysis from an Elementary Point of View", Cambridge University Press, 2014. Draft available for download from: http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/
[GraphLabPapers] http://graphlab.org/resources/publications.html
[ChakDataStream] CS49 Data Stream Algorithms, Amit Chakrabarti, Dartmouth College, Fall 2011, http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11/Notes/lecnotes.pdf
[ShaprioStockman] Shaprio and Stockman, Computer Vision, 2000, Chapter 4.2-4.9, https://courses.cs.washington.edu/courses/cse576/book/ch4.pdf
[Top10] X.Wu et al, "Top 10 Algorithms in Data Mining,”, Knowledge Information System (2008) 14:1-37, also available at: http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf
Tentative Timetable
Lecture Date | Topic | Period | Recommended Readings | Additional References | |
---|---|---|---|---|---|
Sep 2, 4 | Course Admin ; Overview of Big Data and Era of Cloud Computing | T2, H2-3 | [Jlin]Ch1 ; [MMDS]Ch1 | [DataCenter] | |
**Sep 9 the Chinese Mid-Autumn Festival** | |||||
Sep 11 | MapReduce | H2-3 | [MMDS]Ch2.1-2.4 | - | |
**Sep 16: Class cancelled due to Typhoon** | |||||
Sep 18, 23, 25 | MapReduce (cont'd) | T2, H2-3 | [JLin]Ch2 ; [JLin]Ch3.1-3.4 | [CloudData] | |
Sep 25, Oct 7, 9 | Frequent Item-Set Mining and Association Rules | T2, H2-3 | [MMDS]Ch6.1-6.4 | ||
**No Class for Sep 30. Instructor will be on conference leave, make-up class is scheduled for Dec 1 9:30am to 12:30pm** | |||||
**Oct 2 Chung Yeung Festival** | |||||
Oct 14, 16 | Data Stream Algorithms | T2, H2-3 | [MMDS] Ch4.1-4.5 | - | |
**An in-class Mid-term will be held on Oct 21 (Tue)** | |||||
Oct 23 | Data Stream Algorithms (cont'd) | H2-3 | [ChakDataStream] Ch0,Ch1,Ch4.4,Ch6 | - | |
Oct 28, 30 | Finding Similar Items and Locality Sensitive Hash (LSH) | T2, H2-3 | [MMDS]Ch3.1-3.5 | [ZG] | |
Nov 4, 6 | Clustering and GMM | T2, H2-3 | [MMDS] Ch7.1-7.4, [CBishop] Ch.9 | - | |
Nov 11, 13 | Dimension Reduction; | T2, H2-3 | [MMDS] Ch11 ; [SVDPCA], [ANgCS229PCA], [ShaliziADAEPV]Ch17 | [PCA], [GuruswamiKannan] | |
Nov 18 | Recommendation Systems | T2, H2-3 | [MMDS] Ch9 | [Netflix09]; [KorenTalk] | |
**Class Suspended for Nov 20 (Thu) due to University Congregation** | |||||
Nov 25,27 | Analyzing Massive Graphs | T2, H2-3 | [JLin] Ch5 | - | |
Dec 1 | Graph-based Distributed Proc. Systems | Mon 9:30am to 12:30pm at ERB1009 | [GraphLabPapers] | - | |
Dec 1 | IF Time permits: Supervised Learning Overview ; Decision Tree | Mon 9:30am to 12:30pm at ERB1009 | [ShaprioStockman] Ch4.2-4.9 | [ANg], [AMoore] |
Course Assessment
Your grade will be based on the following components:
- Homeworks & Programming assignments (4-5 sets in total): 50%
- Mid-term: 15%
- Final Exam: 35% (2-hour final examination)
Student/Faculty Expectations on Teaching and Learning
http://www.erg.cuhk.edu.hk/Student-Faculty-Expectations
Academic Honesty
You are expected to do your own work and acknowledge the use of anyone else's words or ideas. You MUST put down in your submitted work the names of people with whom you have had discussions.
Refer to http://www.cuhk.edu.hk/policy/academichonesty for details
When scholastic dishonesty is suspected, the matter will be turned over to the University authority for action.
You MUST include the following signed statement in all of your submitted homework, project assignments and examinations. Submission without a signed statement will not be graded.
I declare that the assignment here submitted is original except for source material explicitly acknowledged, and that the same or related material has not been previously submitted for another course. I also acknowledge that I am aware of University policy and regulations on honesty in academic work, and of the disciplinary guidelines and procedures applicable to breaches of such policy and regulations, as contained in the website http://www.cuhk.edu.hk/policy/academichonesty/.