~ Home ~

Description

This course aims to provide students an understanding in the operating principles and hands-on experience with mainstream Big Data Computing systems. Open-source platforms for Big Data processing and analytics would be discussed. Topics to be covered include:

Programming models and design patterns for mainstream Big Data computational frameworks ;
System Architecture and Resource Management for Data-center-scale Computing ;
System Architecture and Programming Interface of Distributed Big Data stores ;
High-level Big Data Query languages and their processing systems ;
Operational and Programming tools for different stages of the Big Data processing pipeline including data collection/ ingestion, serialization and migration, workflow coordination.

Course Pre-requisite:

This course contains substantial hands-on components which require solid background in programming and hands-on operating systems experience. IERG 4300/ENGG 4030 is an official pre-requisite.

Course Information

Lecture time and venue:

MMW LT1 ; MON 9:30am - 11:15am
LSK LT1 ; FRI 9:30am - 11:15am

Lecture time and venue(ESTR4316):

Instructor:

Prof. Wing Cheong Lau.
- wclau [at] ie [dot] cuhk [dot] edu [dot] hk
- Office hours: MON 13:00pm - 14:00pm or by appointment (SHB 818)

Tutorial:

ERB 408 ; Tues. 18:30-19:15

Teaching Assistant:

Wu Tong
- wt017 [at] ie [dot] cuhk [dot] edu [dot] hk
- Office hour: MON 2:15pm - 3:00pm (SHB 826A or SHB 803)

Website account:

User: bigdata
Password: spring2019bigdata

Recommended Text

[HadoopAppArch] Hadoop Application Architectures 1st Edition, by Mark Grover, Ted Malaska, Jonathan Seidman and Gwen Shapira, Publisher: O’Reilly Media, July 2015.
[Hadoop] Hadoop: The Definitive Guide 4th Edition, by Tom White, published by Oreilly, April 2015.
[JLin] Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, Morgan and Claypool Publishers, 2010, can be freely downloaded from http://lintool.github.io/MapReduceAlgorithms/
[DataIntensive] Designing Data-Intensive Applications: The Big Ideas behind Reliable, Scalable and Maintainable Systems, Preview Edition, by Martin Kleppmann, Publisher: O'Reilly Media, 1st Edition to be published in 2016.
[StormApplied] Storm Applied, by Sean T. Allen, Matthew Jankowski and Peter Pathirana, Publisher: Manning, 2015
[BigData] Big Data: Principles and Best Practices of Scalable Realtime Data Systems, by Nathan Marz and James Warren, Publisher: Manning, 2015
[NoSQL] NoSQL Overview, Appendix A of the book titled "Graph Databases", by Ian Robinson, Jim Webber and Emil Eifrem (Can request a free copy from http://graphdatabases.com)
[MMDS] Mining of Massive Datasets (Download version 1.3) by Anand Rajaraman, Jeff Ullman and Jure Leskovec, Cambridge University Press. Latest version can be downloaded from http://i.stanford.edu/~ullman/mmds.html#latest
[LearnSpark] Learning Spark: Lightning-Fast Big Data Analysis, 1st Edition, by Karau, Konwinski, Wendell and Zaharia, published by Oreilly, 2015
[Spark 1] Apache® Spark™ Analytics Made Simple, http://go.databricks.com/apache-spark-analytics-made-simple-databricks
[Spark 2] Mastering Advanced Analytics with Apache Spark®,http://go.databricks.com/mastering-advanced-analytics-apache-spark
[Spark 3] Lessons for Large-Scale Machine Learning Deployments on Apache Spark,http://go.databricks.com/large-scale-machine-learning-deployments-spark-databricks
[Spark 4] Mastering Apache Spark 2.0, http://go.databricks.com/mastering-apache-spark-2.0
[CloudComputing] Cloud Computing for Science and Engineering, by Ian Foster and Dennis B. Gannon, https://cloud4scieng.org/chapters/
[KafkaBook] Neha Narkhede, Gwen Shapira, Todd Palino, Kafka: The Definitive Guide, published by O'Reilly Media, July 2017, https://book.huihoo.com/pdf/confluent-kafka-definitive-guide-complete.pdf
[KleppmannMSSS] Martin Kleppmann, Making Sense of Stream Processing, published by O'Reilly Media, Mar 2016, https://www.oreilly.com/data/free/stream-processing.csp
[Samza] Martin Kleppmann, "Apache Samza," a chapter on Apache Samza for the Encyclopedia of Big Data Technologies, March 2018, https://martin.kleppmann.com/papers/samza-encyclopedia.pdf
[EncyclopBigData] Encyclopedia of Big Data Technologies, Springer Link, First Online: April 2018. https://link.springer.com/referenceworkentry/10.1007/978-3-319-63962-8_303-1
[Streaming101] Tyler Akidau, "Streaming 101: The world beyond batch - A high-level tour of modern data-processing concepts," Aug 2015 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[Streaming102] Tyler Akidau, "Streaming 102: The world beyond batch - The what, where, when and how of unbounded data processing," Jan 2016, https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
[StreamingSys] Tyler Akidau, Slava Chernyak, Reuven Lax, Streaming Systems, published by O'Reilly Media, July 2018, http://shop.oreilly.com/product/0636920073994.do
[Flink] Paris Carbone et al, "Apache Flink: Stream and Batch Processing in a Single Engine," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, http://asterios.katsifodimos.com/assets/publications/flink-deb.pdf
[FlinkBook1] Ellen Friedman, Kostas Tzoumas, Introduction to Apache Flink, published by O'Reilly Media, Oct 2016, online free version accessible from: https://mapr.com/ebooks/intro-to-apache-flink/
[FlinkBook2] Vasiliki Kalavri, Fabian Hueske, Stream Processing with Apache Flink, (Early Release Edition), published by O'Reilly Media, Feb 2018, https://www.oreilly.com/library/view/stream-processing-with/9781491974285/ https://info.lightbend.com/rs/558-NCX-702/images/preview-apache-flink.pdf
[Spark2018] Bill Chambers, Matei Zaharia, Spark: The Definitive Guide: Big Data Processing Made Simple (1st Edition), published by O'Reilly Media Feb 2018, http://shop.oreilly.com/product/0636920034957.do

Recommended Programming References

[SparkAnalytics] Advanced Analytics with Spark, by Sandy Ryza, Uri Laserson, Sean Owen and Josh Wills, Publisher: O’Reilly Media, April 2015
[DataAlgorithms] Data Algorithms: Recipes for Scaling Up with Hadoop and Spark, by Mahmoud Parsian, Publisher: O'Reilly Media, Aug 2015
[LearnSpark] Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, Publisher: O’Reilly Media, Feb 2015
[HBase] HBase: The Definitive Guide, by Lars George, published by O’Reilly Media,.
[CassandraBook] Cassandra: The Definitive Guide, by Eben Hewitt, published by O’Reilly Media,.
[ZooKeeper] ZooKeeper: Distributed Process Coordination, by Flavio Junqueira and Benjamin Reed, published by O’Reilly Media, 2013
[Pig] Programming Pig, by Alan Gates, published by O’Reilly Media
[Hive] Programming Hive, by Edward Capriolo, Dean Wampler, Jason Rutherglen, published by O’Reilly Media,
[OpenStackOp] OpenStack Operations Guide, published by O’Reilly Media, (current-version available online at: http://docs.openstack.org/openstack-ops/content )
[OozieBook] Apache Oozie : The Workflow Scheduler for Hadoop, by Mohammad Kamrul Islam, Aravind Srinivasan published by O'Reilly Media, 2015.
[Storm] Hortonworks Data Platform - Apache Storm Component Guidehttps://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_storm-component-guide/bk_storm-component-guide.pdf

Tentative Timetable

Lecture Date	Topic	Period	Recommended Readings	Additional References
Jan 7, 11	Course Admin;Resource Management for Data-center-scale Computing: Hadoop YARN, Mesos and beyond;	9:30am - 11:15am	[YARN], [Mesos],[Hadoop]Ch.2-3, [CloudData], [Kubernetes1]	[Borg], [Omega], [Sparrow], [Apollo], [Mercury], [MapReduceFamilySurvey2013], [Kubernetes2]
Jan 14, 18	ZooKeeper	9:30am - 11:15am	[ZooKeeper1]	[ZooKeeper], [ZAB1], [ZAB2]
Jan 21, 25	Programming Models (beyond MapReduce) for Big Data Computing: DAG-based Computational Frameworks: Dryad, DryadLINQ, TeZ	9:30am - 11:15am	[Dryad], [DryLINQ], [Tez]	-
Jan 28, Feb 1	High-level Data Query Languages for Big Data Analytics: Pig and Hive	9:30am - 11:15am	[PigLatin], [Hive1], [Hadoop]Ch.16-17	[Hive2], [Hive3], [HiveAdvances], [Pig], [Hive]
Feb 4, 8	No class	Chinese New Year Holidays
Feb 11, 15	Programming Models (beyond MapReduce) for Big Data Computing: Stream-based Processing: Storm	9:30am - 11:15am	[Storm@Twitter], [StormApplied], [Heron]	[BigData]
Feb 18, 22	Lambda Architecture Kappa Architecture; Unified Log: Apache Kafka Apache Samza	9:30am - 11:15am	[KafkaBook],[Samza],[KleppmannMSSS]	-
Feb 25, Mar 1	Programming Models (beyond MapReduce) for Big Data Computing: Graph-based Computing frameworks: Pregel/Giraph and GraphLab	9:30am - 11:15am	[GraphLab1], [PowerGraph]	[GraphChi]
Mar 4, 8	Programming Models (beyond MapReduce) for Big Data Computing: Spark: Spark and BDAS, Quick Tour of Scala, Spark RDDs	9:30am - 11:15am	[Spark2018]	[SparkScaling], [MapReduceVsSpark], [LearnSpark]Ch.1, Ch.10 ; [SparkAnalytics] Appendix A
Mar 8	-	9:30am - 10:30am	Mid-term for IERG 4330/ESTR 4316
Mar 11, 15, 18	Programming Models (beyond MapReduce) for Big Data Computing: Spark (cont'd): SparkSQL, Spark Streaming, GraphX, MLlib	9:30am - 11:15am	[GraphX], [SparkStreaming], [SparkSQL], [SparkMLlib], [LearnSpark]Ch.11 ;	[SharkSQL], [SparkMBase], [SparkMLI]
Mar 22, 25	More Streaming Concepts: Event-time vs. Ingestion Time vs. Processing Time !! Windows: Sliding vs. Tumbling vs. Session; Trigger; Loop Iteration? Lambda vs. Advanced Streaming Systems: Apache Beam; Apache Flink	9:30am - 11:15am	[StreamingSys],[FlinkBook1],[FlinkBook2],[Flink]	-
Mar 29, Apr 1, 5	No class	Instructor on conference leave for Mar 29 and reading week for Apr1, 5
Apr 8, 12, 15	Distributed Big Data Stores; BigTable/ HBase, Dynamo, Cassandra	9:30am - 11:15am	[Dynamo], [BigTable], [Cassandra], [RealtimeHadoopFacebook], [NoSQL], [Hadoop]Ch.20	[HBase], [CassandraBook]

Course Assessment

The grade of IERG 4330 student is based on the following components:

Homework & Programming assignments (4 sets in total): 50%
Mid-term Exam: 10% (1-hour mid-term examination)
Final Exam: 40% (2-hour final examination)

Student/Faculty Expectations on Teaching and Learning

http://mobitec.ie.cuhk.edu.hk/StaffStudentExpectations.pdf

Academic Honesty

You are expected to do your own work and acknowledge the use of anyone else's words or ideas. You MUST put down in your submitted work the names of people with whom you have had discussions.

Refer to http://www.cuhk.edu.hk/policy/academichonesty for details

When scholastic dishonesty is suspected, the matter will be turned over to the University authority for action.

You MUST include the following signed statement in all of your submitted homework, project assignments and examinations. Submission without a signed statement will not be graded.

I declare that the assignment here submitted is original except for source material explicitly acknowledged, and that the same or related material has not been previously submitted for another course. I also acknowledge that I am aware of University policy and regulations on honesty in academic work, and of the disciplinary guidelines and procedures applicable to breaches of such policy and regulations, as contained in the website http://www.cuhk.edu.hk/policy/academichonesty/.

Acknowledgement

Thanks to Amazon Web Services, Google and Microsoft Azure for providing free computing resource support of this course

IERG 4330 / ESTR 4316
Programming Big Data Systems

(Offered in 2019 Spring)