~ Home ~

Description

This course aims to provide students an understanding in the operating principles and hands-on experience with mainstream Big Data Computing systems. Open-source platforms for Big Data processing and analytics would be discussed. Topics to be covered include:

Programming models and design patterns for mainstream Big Data computational frameworks ;
System Architecture and Resource Management for Data-center-scale Computing ;
System Architecture and Programming Interface of Distributed Big Data stores ;
High-level Big Data Query languages and their processing systems ;
Operational and Programming tools for different stages of the Big Data processing pipeline including data collection/ ingestion, serialization and migration, workflow coordination.

Course Pre-requisite:

This course contains substantial hands-on components which require solid background in programming and hands-on operating systems experience. If you have never used a command-line interface to install/configure/manage an operating system, e.g. a linux-based one, you will need to pick-up the skills yourself and IT CAN BE VERY TIME-CONSUMING for you to complete the homeworks. (Students without the aforementioned required background may take several 10's of hours to finish EACH homework assignment).

Course Information

The lectures and the tutorials will be conducted in ZOOM:

ZOOM meeting link:https://cuhk.zoom.us/j/96033060088
ZOOM meeting ID: 960 3306 0088
For students who have registered for the course, the password of the Zoom meeting has been posted to Blackboard.
For students who did not register for the course but wants to access the meeting, please email the TA to obtain the meeting password. (See below for TAs' email addresses)

Lecture time and venue:

Tue 7:00pm - 10:00pm;

Instructor:

Prof. Wing Cheong Lau. wclau [at] ie [dot] cuhk [dot] edu [dot] hk
Office hours: TBA

Lab/Tutorial:

Teaching Assistant:

Da Sun Handason Tam
- tds019 [at] ie [dot] cuhk [dot] edu [dot] hk
- TBA
LIU Yang
- yangliu476730 [at] yahoo [dot] com (Preferred)
- ly016 [at] ie [dot] cuhk [dot] edu [dot] hk
- TBA

Website account:

User: bigdata
Password: spring2021bigdata

Recommended Text

[HadoopAppArch] Hadoop Application Architectures 1st Edition, by Mark Grover, Ted Malaska, Jonathan Seidman and Gwen Shapira, Publisher: O’Reilly Media, July 2015.
[Hadoop] Hadoop: The Definitive Guide 4th Edition, by Tom White, published by Oreilly, April 2015.
[JLin] Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, Morgan and Claypool Publishers, 2010, can be freely downloaded from http://lintool.github.io/MapReduceAlgorithms/
[DataIntensive] Designing Data-Intensive Applications: The Big Ideas behind Reliable, Scalable and Maintainable Systems, Preview Edition, by Martin Kleppmann, Publisher: O'Reilly Media, 1st Edition, March 2017.
[StormApplied] Storm Applied, by Sean T. Allen, Matthew Jankowski and Peter Pathirana, Publisher: Manning, 2015
[BigData] Big Data: Principles and Best Practices of Scalable Realtime Data Systems, by Nathan Marz and James Warren, Publisher: Manning, 2015
[NoSQL] NoSQL Overview, Appendix A of the book titled "Graph Databases", by Ian Robinson, Jim Webber and Emil Eifrem (Can request a free copy from http://graphdatabases.com)
[MMDS] Mining of Massive Datasets (Download version 1.3) by Anand Rajaraman, Jeff Ullman and Jure Leskovec, Cambridge University Press. Latest version can be downloaded from http://i.stanford.edu/~ullman/mmds.html#latest
[LearnSpark] Learning Spark: Lightning-Fast Big Data Analysis, 1st Edition, by Karau, Konwinski, Wendell and Zaharia, published by Oreilly, 2015
[LearnSpark2ndEd] Learning Spark: Lightning-Fast Big Data Analysis, 2nd Edition, by Jules S. Damji, Brooke Wenig, Tathagata Das and Denny Lee, published by O'reilly, July 2020
[Spark 1] Apache® Spark™ Analytics Made Simple, http://go.databricks.com/apache-spark-analytics-made-simple-databricks
[Spark 2] Mastering Advanced Analytics with Apache Spark®,http://go.databricks.com/mastering-advanced-analytics-apache-spark
[Spark 3] Lessons for Large-Scale Machine Learning Deployments on Apache Spark,http://go.databricks.com/large-scale-machine-learning-deployments-spark-databricks
[Spark 4] Mastering Apache Spark 2.0, http://go.databricks.com/mastering-apache-spark-2.0
[CloudComputing] Cloud Computing for Science and Engineering, by Ian Foster and Dennis B. Gannon, https://cloud4scieng.org/chapters/
[KafkaBook] Neha Narkhede, Gwen Shapira, Todd Palino, Kafka: The Definitive Guide, published by O'Reilly Media, July 2017, https://book.huihoo.com/pdf/confluent-kafka-definitive-guide-complete.pdf
[KleppmannMSSS] Martin Kleppmann, Making Sense of Stream Processing, published by O'Reilly Media, Mar 2016, https://www.oreilly.com/data/free/stream-processing.csp
[Samza] Martin Kleppmann, "Apache Samza," a chapter on Apache Samza for the Encyclopedia of Big Data Technologies, March 2018, https://martin.kleppmann.com/papers/samza-encyclopedia.pdf
[EncyclopBigData] Encyclopedia of Big Data Technologies, Springer Link, First Online: April 2018. https://link.springer.com/referenceworkentry/10.1007/978-3-319-63962-8_303-1
[Streaming101] Tyler Akidau, "Streaming 101: The world beyond batch - A high-level tour of modern data-processing concepts," Aug 2015 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[Streaming102] Tyler Akidau, "Streaming 102: The world beyond batch - The what, where, when and how of unbounded data processing," Jan 2016, https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
[StreamingSys] Tyler Akidau, Slava Chernyak, Reuven Lax, Streaming Systems, published by O'Reilly Media, July 2018, http://shop.oreilly.com/product/0636920073994.do
[Flink] Paris Carbone et al, "Apache Flink: Stream and Batch Processing in a Single Engine," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, http://asterios.katsifodimos.com/assets/publications/flink-deb.pdf
[FlinkBook1] Ellen Friedman, Kostas Tzoumas, Introduction to Apache Flink, published by O'Reilly Media, Oct 2016, online free version accessible from: https://mapr.com/ebooks/intro-to-apache-flink/
[FlinkBook2] Vasiliki Kalavri, Fabian Hueske, Stream Processing with Apache Flink, (Early Release Edition), published by O'Reilly Media, Feb 2018, https://www.oreilly.com/library/view/stream-processing-with/9781491974285/ https://info.lightbend.com/rs/558-NCX-702/images/preview-apache-flink.pdf
[Spark2018] Bill Chambers, Matei Zaharia, Spark: The Definitive Guide: Big Data Processing Made Simple (1st Edition), published by O'Reilly Media Feb 2018, http://shop.oreilly.com/product/0636920034957.do

Tentative Timetable

Lecture Date	Topic	Period	Recommended Readings	Optional Readings/ Additional Resources
Jan 12 (Tue), Jan 19 (Tue)	Course Admin; Era of Big Data ; Data-center Architecture	7:00pm - 10:00pm	[Jin1]Ch1, [DataCenter]	[Borg], [Omega], [Sparrow], [Apollo], [Mercury], [MapReduceFamilySurvey2013], [Kubernetes2]
Jan 26 (Tue)	Programming Models for Big Data Computing: MapReduce/ Hadoop, GFS/HDFS	7:00pm - 10:00pm	[MapReduce], [GoogleFileSystem], [MMDS]Ch2.1-2.4, [JLin]Ch2, Ch3.1-3.4
Feb 2 (Tue)	Big Data Processing Stack	7:00pm - 10:00pm	[Hadoop]Ch2-3	[Kubernetes1], [Kubernetes2]
Feb 9 (Tue)	High-level Data Query Languages for Big Data Analytics: Pig and Hive	7:00pm - 10:00pm	[PigLatin], [Hive1], [Hadoop]Ch.16-17	[Hive2], [Hive3], [HiveAdvances], [Pig], [Hive]
Feb 16 (Tue)	No class	Chinese New Year Holidays
Feb 23 (Tue)	Distributed Big Data Stores; BigTable/ HBase, Dynamo, Cassandra	7:00pm - 10:00pm	[Dynamo], [BigTable], [RealtimeHadoopFacebook], [NoSQL], [Hadoop]Ch.20, [Cassandra]	[HBase], [CassandraBook]
Mar 2 (Tue)	Programming Models (beyond MapReduce) for Big Data Computing: Spark: Spark and BDAS, Quick Tour of Scala, Spark RDDs	7:00pm - 10:00pm	[Spark2018]	[SparkScaling], [MapReduceVsSpark], [LearnSpark]Ch.1, Ch.10 ; [SparkAnalytics] Appendix A
Mar 9 (Tue)	SparkSQL	7:00pm - 10:00pm	[SparkSQL], [LearnSpark2]	[SharkSQL], [SparkMBase], [LearnSpark2ndEd] Ch 3-6
Mar 16 (Tue), Mar 23 (Tue)	Big Stream Processing frameworks: Unified Log via Apache Kafka; Storm ; Spark Streaming ; Spark Structural Streaming ; Lambda & Kappa Architecture;	7:00pm - 10:00pm	[Storm@Twitter], [Heron], [SparkStreaming], [LearnSpark2ndEd] Ch 8	[KafkaBook],[KleppmannMSSS],[StormApplied]
Mar 30 (Reading Week) (Tue)	Optional Seminars in Reading Week: More Streaming Concepts: Event-time vs. Ingestion Time vs. Processing Time !! Windows: Sliding vs. Tumbling vs. Session; Trigger; Loop Iteration? Lambda vs. Advanced Streaming Systems: Apache Beam; Apache Flink	7:00pm - 10:00pm	[StreamingSys], [FlinkBook1], [FlinkBook2], [Flink]
Apr 6	No class	The day following Easter Monday
Apr 13 (Tue)	Big Graph Processing frameworks: Pregel/Giraph and GraphLab ; GraphX, GraphFrame	7:00pm - 10:00pm	[GraphLab1], [PowerGraph], [GraphX]	[GraphChi]
Apr 20 (Tue), Apr 27 (Tue)	Overflow ; Time-permitting: Spark MLib, MLflow, Spark 3.0 and Beyond	7:00pm - 10:00pm	[SparkMLlib], [LearnSpark] Ch.11, [LearnSpark2ndEd] Ch 9

Course Assessment

Your grade will be based on the following components:

Homework & Programming Assignments (HW0 + 5 additional sets): 65%
Project (Video oral presentation + written report): 15%
Q&A Design Assignment: 15%
Class Participation: 5%

Guidelines for Q&A assignment

NO late submissions will be allowed as we need to submit our final grades shortly after the corresponding deadline.

The Q&A-design assignment is to ask each student to design and submit a set of questions AND model-answers/ suggested solutions for a future 2-hr-long final examination of this course. To avoid asking trivial questions which merely test the memorization ability of the exam takers, you should assume the exam to be an open-book/open-notes exam. Your submission will be graded according to its:

ORIGINALITY and thoughtfulness of the questions, i.e., non-trivial and be able to highlight and test/promote the most important concepts/ ideas/ techniques which have been taught in our class so far.
Correctness of the suggested solutions/ model answers.
Comprehensive nature (or the lack of), i.e. your set of questions together, should cover multiple (the more, the better) key concepts/ ideas/ techniques taught in our class so far. In other words, setting 1-2 long essay questions on a couple specific topics to try to take up the entire 2-hr exam period won’t be a good choice.
Suitability of the overall set of questions for a time-limited 2-hr exam. In other words, it should be reasonable for a student to complete your proposed set of questions within the time limit.

Since the originality and thoughtfulness of the proposed questions are of key considerations, you MUST NOT copy or merely re-phrase questions found elsewhere (i.e. from similar courses elsewhere or textbooks) and submit them as your own creation. Instead, study our course materials and reference readings/ text, ask yourself which are the most important concepts you have learned from this course and then try to design the related questions for the various key concepts. The goal of your exam-paper should be to promote/ strengthen a student’s understanding of such concept. i.e. viewing your questions as training exercises for the exam taker. To enhance the comprehensive nature of your exam, in other words, be able to cover a large number of important/ key concepts, you may mix different types of questions in your exam design, e.g. i) a section of multiple-choice or True/False questions (For T/F type of questions, you MUST require students to provide not only T/F answer but also a couple of sentences to justify their answers) ; additional sections for ii) Short questions with multiple parts ; and iii) questions for competitive analysis of different approaches on solving relevant problems/ challenges discussed in the course.

Student/Faculty Expectations on Teaching and Learning

http://mobitec.ie.cuhk.edu.hk/StaffStudentExpectations.pdf

Academic Honesty

You are expected to do your own work and acknowledge the use of anyone else's words or ideas. You MUST put down in your submitted work the names of people with whom you have had discussions.

Please read the following slides carefully: AcademicHonestySlides.pdf

Refer to http://www.cuhk.edu.hk/policy/academichonesty for details

When scholastic dishonesty is suspected, the matter will be turned over to the University authority for action.

You MUST include the following signed statement in all of your submitted homework, project assignments and examinations. Submission without a signed statement will not be graded.

I declare that the assignment here submitted is original except for source material explicitly acknowledged, and that the same or related material has not been previously submitted for another course. I also acknowledge that I am aware of University policy and regulations on honesty in academic work, and of the disciplinary guidelines and procedures applicable to breaches of such policy and regulations, as contained in the website http://www.cuhk.edu.hk/policy/academichonesty/.

Acknowledgement

Thanks to Amazon Web Services, Google and Microsoft Azure for providing free computing resource support of this course

IEMS 5730
Big Data Systems and Information Processing

(Offered in 2021 Spring)