~ Home ~
IEMS 5730 Make-up Sessions via Zoom
Dear all,
Attached below is the Zoom meeting info for the make-up classes of IEMS5730, to be held on:
May 7 (Thu) 2:30-5:30pm
May 11 (Mon) 2:30-5:30pm
May 14 (Thu) 2:30-5:30pm
May 18 (Mon) 2:30-5:30pm
May 21 (Thu) 2:30-5:30pm
The Zoom meeting ID is:
- Meeting ID: 960 3306 0088
Regards,
Liu Yang
IMPORTANT: Revised Course Assessment Scheme for IEMS 5730
Dear all,
Due to the cancellation of all remaining classes as well as centralized examinations by the University, we have revised the assessment scheme of IEMS5730 as follows:
Instead of 5 sets of homeworks, we now have only 4 sets (all past due already). These homeworks together will still contribute to a total of 40% of the overall course grade.
There will be no final exam for the course. Instead, you will have an additional "Q&A" writing assignment, to be due on Dec 17, which carries 25% of the overall grade. In particular, this assignment is to ask each student to design and submit a set of questions AND model-answers/suggested solutions for a future 2-hr-long final examination of IEMS5730. To avoid asking trivial questions which merely test the memorization ability of the exam takers, you should assume the exam to be an open-book/open-note exam. Your submission will be graded according to its:
a. ORIGINALITY and thoughtfulness of the questions, i.e., non-trivial and be able to highlight and test/promote the most important concepts/ ideas/ techniques which have been taught in our class so far.
b. Correctness of the suggested solutions/ model answers.
c. Comprehensive nature (or the lack of), i.e. your set of questions together, should cover multiple (the more, the better) key concepts/ ideas/ techniques taught in our class so far. In other words, setting a single MapReduce question to take up the entire 2-hr exam period won't be a good choice.
d. Suitability of the overall set of questions for a time-limited 2hr exam. In other words, it should be reasonable for a student to complete your proposed set of questions within the 2hr limit.
Since the originality and thoughtfulness of the proposed questions are the key considerations, you MUST NOT copy or merely adapt/re-phrase questions found elsewhere (i.e. from past papers of IEMS5730 or similar courses or textbooks) and submit as your own work. Instead, study our course materials and then ask yourself which are the most important concepts you have learned from this course and then try to design a related question for each (some) of those concepts to promote/ strengthen a student's understanding of such concept. i.e. viewing your questions as training exercises for the exam taker).
- The Project will remain but its weighting will be increased from 20% to 35% of the overall course grade. However, instead of doing a face-to-face oral presentation, each student will produce and submit a 20-minute video for his/her own presentation. For the project topics originally taken up by a group (instead of a single student), each group member needs to prepare for his/her 20-minute video, preferably with some coordination, if possible, between the members so that there would not be excessive duplication, e.g. you can discuss how to divide up the papers of the topic among yourself. Besides the video, each student will need to submit i) the powerpoint slides for your presentation and ii) a written report (just as originally planned).
Again, group-based topic can choose to submit
either:
a) a single report per group (NO MORE THAN 10-page per person), clearly stating which student is responsible for which sections in the report ;
OR
b) a self-contained report per student (NO more than 10 pages), just like all other individual projects.
All the materials, namely, the video, ppt slides as well as the the final written report are due on Dec 30.
If you have further questions, please do not hesitate to contact me.
Best Regards,
Wing
Description
This course aims to provide students an understanding in the operating principles and hands-on experience with mainstream Big Data Computing systems. Open-source platforms for Big Data processing and analytics would be discussed. Topics to be covered include:
Programming models and design patterns for mainstream Big Data computational frameworks ;
System Architecture and Resource Management for Data-center-scale Computing ;
System Architecture and Programming Interface of Distributed Big Data stores ;
High-level Big Data Query languages and their processing systems ;
Operational and Programming tools for different stages of the Big Data processing pipeline including data collection/ ingestion, serialization and migration, workflow coordination.
Course Pre-requisite:
This course contains substantial hands-on components which require solid background in programming and hands-on operating systems experience. If you have never used a command-line interface to install/configure/manage an operating system, e.g. a linux-based one, you will need to pick-up the skills yourself and IT CAN BE VERY TIME-CONSUMING for you to complete the homeworks. (Students without the aforementioned required background may take several 10's of hours to finish EACH homework assignment).
Course Information
Lecture time and venue:
- YIA LT4; Tue 3:30pm - 6:15pm;
Instructor:
- Prof. Wing Cheong Lau.
wclau [at] ie [dot] cuhk [dot] edu [dot] hk
- Office hours: Wed 1:00pm to 2:00pm (SHB 818)
Teaching Assistant:
LIU Yang
yangliu476730 [at] yahoo [dot] com
- Office hour: Mon 14:30pm - 15:30 pm (SHB 803)
LI Yi Ming
1155107969 [at] link [dot] cuhk [dot] edu [dot] hk
- Office hour: TBD (SHB 803)
Website account:
User: bigdata
Password: fall2019bigdata
Recommended Text
[HadoopAppArch] Hadoop Application Architectures 1st Edition, by Mark Grover, Ted Malaska, Jonathan Seidman and Gwen Shapira, Publisher: O’Reilly Media, July 2015.
[Hadoop] Hadoop: The Definitive Guide 4th Edition, by Tom White, published by Oreilly, April 2015.
[JLin] Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, Morgan and Claypool Publishers, 2010, can be freely downloaded from http://lintool.github.io/MapReduceAlgorithms/
[DataIntensive] Designing Data-Intensive Applications: The Big Ideas behind Reliable, Scalable and Maintainable Systems, Preview Edition, by Martin Kleppmann, Publisher: O'Reilly Media, 1st Edition, March 2017.
[StormApplied] Storm Applied, by Sean T. Allen, Matthew Jankowski and Peter Pathirana, Publisher: Manning, 2015
[BigData] Big Data: Principles and Best Practices of Scalable Realtime Data Systems, by Nathan Marz and James Warren, Publisher: Manning, 2015
[NoSQL] NoSQL Overview, Appendix A of the book titled "Graph Databases", by Ian Robinson, Jim Webber and Emil Eifrem (Can request a free copy from http://graphdatabases.com)
[MMDS] Mining of Massive Datasets (Download version 1.3) by Anand Rajaraman, Jeff Ullman and Jure Leskovec, Cambridge University Press. Latest version can be downloaded from http://i.stanford.edu/~ullman/mmds.html#latest
[LearnSpark] Learning Spark: Lightning-Fast Big Data Analysis, 1st Edition, by Karau, Konwinski, Wendell and Zaharia, published by Oreilly, 2015
[Spark 1] Apache® Spark™ Analytics Made Simple, http://go.databricks.com/apache-spark-analytics-made-simple-databricks
[Spark 2] Mastering Advanced Analytics with Apache Spark®,http://go.databricks.com/mastering-advanced-analytics-apache-spark
[Spark 3] Lessons for Large-Scale Machine Learning Deployments on Apache Spark,http://go.databricks.com/large-scale-machine-learning-deployments-spark-databricks
[Spark 4] Mastering Apache Spark 2.0, http://go.databricks.com/mastering-apache-spark-2.0
[CloudComputing] Cloud Computing for Science and Engineering, by Ian Foster and Dennis B. Gannon, https://cloud4scieng.org/chapters/
[KafkaBook] Neha Narkhede, Gwen Shapira, Todd Palino, Kafka: The Definitive Guide, published by O'Reilly Media, July 2017, https://book.huihoo.com/pdf/confluent-kafka-definitive-guide-complete.pdf
[KleppmannMSSS] Martin Kleppmann, Making Sense of Stream Processing, published by O'Reilly Media, Mar 2016, https://www.oreilly.com/data/free/stream-processing.csp
[Samza] Martin Kleppmann, "Apache Samza," a chapter on Apache Samza for the Encyclopedia of Big Data Technologies, March 2018, https://martin.kleppmann.com/papers/samza-encyclopedia.pdf
[EncyclopBigData] Encyclopedia of Big Data Technologies, Springer Link, First Online: April 2018. https://link.springer.com/referenceworkentry/10.1007/978-3-319-63962-8_303-1
[Streaming101] Tyler Akidau, "Streaming 101: The world beyond batch - A high-level tour of modern data-processing concepts," Aug 2015 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[Streaming102] Tyler Akidau, "Streaming 102: The world beyond batch - The what, where, when and how of unbounded data processing," Jan 2016, https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
[StreamingSys] Tyler Akidau, Slava Chernyak, Reuven Lax, Streaming Systems, published by O'Reilly Media, July 2018, http://shop.oreilly.com/product/0636920073994.do
[Flink] Paris Carbone et al, "Apache Flink: Stream and Batch Processing in a Single Engine," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, http://asterios.katsifodimos.com/assets/publications/flink-deb.pdf
[FlinkBook1] Ellen Friedman, Kostas Tzoumas, Introduction to Apache Flink, published by O'Reilly Media, Oct 2016, online free version accessible from: https://mapr.com/ebooks/intro-to-apache-flink/
[FlinkBook2] Vasiliki Kalavri, Fabian Hueske, Stream Processing with Apache Flink, (Early Release Edition), published by O'Reilly Media, Feb 2018, https://www.oreilly.com/library/view/stream-processing-with/9781491974285/ https://info.lightbend.com/rs/558-NCX-702/images/preview-apache-flink.pdf
[Spark2018] Bill Chambers, Matei Zaharia, Spark: The Definitive Guide: Big Data Processing Made Simple (1st Edition), published by O'Reilly Media Feb 2018, http://shop.oreilly.com/product/0636920034957.do
Recommended List of Research Papers for Reading
[MapReduce] Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI 2004.
[GoogleFileSystem] Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung, “The Google File System,” ACM SOSP 2003.
[MapReduceFamilySurvey2013] Sakr S, Liu A, Fayoumi A G. The family of mapreduce and large-scale data processing systems[J]. ACM Computing Surveys (CSUR), 2013.
[YARN] V.K. Vavilapalli, A.C.Murthy, “Apache Hadoop YARN: Yet Another Resource Negotiator,” ACM Symposium on Cloud Computing (SoCC) 2013.
[Mesos] B. Hindman et al, “Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center”, NSDI 2011.
[DRF] A. Ghodsi et al, “Dominant Resource Fairness: Fair Allocation of Multiple Resource Types,” NSDI 2011.
[Borg] A. Verma, L. Pedrosa, “Large-scale cluster management at Google with Borg”, Eurosys 2015
[Omega] M. Schwarzkopf, A. Konwinski, M.Abd-El-Malek, J. Wilkes, “Omega: flexible, scalable schedulers for large compute clusters,” Eurosys 2013
[Sparrow] K. Ousterhout et al, “Sparrow: Distributed, Low Latency Scheduling”, ACM SOSP 2013
[Apollo] E. Boutin et al, “Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing”, OSDI 2014
[Mercury] K. Karanasos et al, “Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters”, Usenix ATC 2015
[GraphLab1] Yucheng Low, Joseph Gonzalez et al, “GraphLab: A New Framework for Parallel Machine Learning,” UAI 2010.
[PowerGraph] Joseph Gonzalez et al, “PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs,” OSDI 2012.
[GraphChi] Aapo Kyrola, Guy Blelloch, Carlos Guestrin, “GraphChi: Large-Scale Graph Computation on Just a PC,” OSDI 2012.
[Storm@Twitter] Ankit Toshniwal et al, “Storm@Twitter,” ACM SIGMOD 2014.
[PigLatin] Christopher Olston et al, “Pig Latin: A Not-So-Foreign Language for Data Processing,” ACM SIGMOD 2008.
[Hive1] Ashish Thusoo et al, “Hive: a warehousing solution over a map-reduce framework,” VLDB 2009.
[Hive2] Ashish Thusoo et al, "Data warehousing and analytics infrastructure at facebook,” ACM SIGMOD 2010
[Hive3] Ashish Thusoo et al, “Hive - A Petabyte Scale Data Warehouse Using Hadoop,” IEEE ICDE 2010.
[HiveAdvances] Yin Huai et al, “Major Technical Advancements in Apache Hive,” ACM SIGMOD 2014.
[Dryad] Michael Isard et al, "Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks,” Eurosys 2007.
[DryadLINQ] Yuan Yu et al, “DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language, “ OSDI 2008.
[DryadLINQ2] Michael Isard, Yuan Yu, "Distributed Data-Parallel Computing Using a High-Level Programming Language,” ACM SIGMOD 2009
[Tez] Bikas Saha et al, "Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications,” ACM SIGMOD 2015.
[Dynamo] Giuseppe DeCandia et al, "Dynamo: Amazon's Highly Available Key-value Store," ACM SOSP 2007.
[BigTable] Fay Chang et al, “Bigtable: A Distributed Storage System for Structured Data,” OSDI 2006.
[Cassandra] Avinash Lakshman, Prashant Malik, “Cassandra - A Decentralized Structured Storage System,” ACM SIGOPS Operating Systems Review, Apr 2010.
[RealtimeHadoopFacebook] Dhruba Borthakur et al, “Apache Hadoop goes realtime at Facebook,” ACM SIGMOD 2011.
[SparkRDD] Matei Zaharia et al, "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” NSDI 2012.
[Spark] Matei Zaharia et al, “Fast and Interactive Analytics over Hadoop Data with Spark,” Usenix ;login Aug 2012.
[Spark Streaming] Matei Zaharia et al, "Discretized streams: Fault-tolerant streaming computation at scale,” ACM SOSP 2013.
[SharkSQL] Reynold S. Xin et al, "Shark: SQL and rich analytics at scale,” ACM SIGMOD 2013
[SparkSQL] Michael Armbrust et al, “Spark SQL: Relational Data Processing in Spark,” ACM SIGMOD 2015.
[GraphX] Joseph E. Gonzalez et al, "GraphX: Graph Processing in a Distributed Dataflow Framework,” OSDI 2014.
[SparkScaling] Michael Armburst et al, “Scaling Spark in the real world: performance and usability,” VLDB 2015.
[MapReduceVsSpark] Juwei Shi et al, "Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics," VLDB 2015.
[SparkMLbase] T. Kraska, A. Talwalkar, J.Duchi, R. Griffith, M. Franklin, M.I. Jordan, "MLbase: A Distributed Machine Learning System," In Conference on Innovative Data Systems Research (CIDR), 2013.
[SparkMLI] E. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, J. Gonzalez, M. Franklin, M. I. Jordan, T. Kraska. MLI: An API for Distributed Machine Learning. In International Conference on Data Mining, 2013.
[SparkMLlib] Xiangrui Meng et al, "MLlib: Machine learning in Apache Spark,” arXiv:1505.06807, May 2015.
[SparkNet] Philipp Moritz, Robert Nishihara, Ion Stoica, Michael Jordan, "SparkNet: Training Deep Networks on Spark,” ICLR 2016.
[Naiad] Derek G. Murray et al, "Naiad: A Timely Dataflow System,” ACM SOSP 2013.
[ZooKeeper1] P Hunt, M Konar, FP Junqueira, B Reed , “ZooKeeper: Wait-free Coordination for Internet-scale Systems,” Usenix ATC 2010.
[ZAB1] Benjamin Reed, Flavio P. Junqueira, “A simple totally ordered broadcast protocol,” 2nd Workshop on Large-scale Distributed Systems and Middleware (LADIS), 2008.
[ZAB2] F.P. Junqueira, B.C. Reed, M. Serafini, "High-performance broadcast for primary-backup systems,” IEEE/IFIP DSN, 2011.
[Kubernetes1] Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes, "Borg, Omega, and Kubernetes - Lessons Learned from Three Container-Management Systems over a decade," ACM Queue, Jan 2016.
[Kubernetes2] David Rensin, Kubernetes - Scheduling the Future at Cloud Scale, (Free eBook) published by O'Reilly 2015.
Recommended Programming References
[SparkAnalytics] Advanced Analytics with Spark, by Sandy Ryza, Uri Laserson, Sean Owen and Josh Wills, Publisher: O’Reilly Media, April 2015
[DataAlgorithms] Data Algorithms: Recipes for Scaling Up with Hadoop and Spark, by Mahmoud Parsian, Publisher: O'Reilly Media, Aug 2015
[LearnSpark] Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, Publisher: O’Reilly Media, Feb 2015
[HBase] HBase: The Definitive Guide, by Lars George, published by O’Reilly Media,.
[CassandraBook] Cassandra: The Definitive Guide, by Eben Hewitt, published by O’Reilly Media,.
[ZooKeeper] ZooKeeper: Distributed Process Coordination, by Flavio Junqueira and Benjamin Reed, published by O’Reilly Media, 2013
[Pig] Programming Pig, by Alan Gates, published by O’Reilly Media
[Hive] Programming Hive, by Edward Capriolo, Dean Wampler, Jason Rutherglen, published by O’Reilly Media,
[OpenStackOp] OpenStack Operations Guide, published by O’Reilly Media, (current-version available online at: http://docs.openstack.org/openstack-ops/content )
[OozieBook] Apache Oozie : The Workflow Scheduler for Hadoop, by Mohammad Kamrul Islam, Aravind Srinivasan published by O'Reilly Media, 2015.
Tentative Timetable
Lecture Date | Topic | Period | Recommended Readings | Additional References | |
---|---|---|---|---|---|
Sept 3 | Era of Big Data; Data-center Architecture | 3:30pm - 6:15pm | [Jlin]Ch1, [DataCenter], [OpenStackOp] | - | |
Sept 10, 17 | Programming Models for Big Data Computing: MapReduce/ Hadoop, GFS/HDFS | 3:30pm - 6:15pm | [MapReduce], [GoogleFileSystem] ; [MMDS]Ch2.1-2.4, [JLin]Ch2,Ch3.1-3.4 | - | |
Sept 24 | Big Data Processing Stack | 3:30pm - 6:15pm | [Hadoop]Ch.2-3; | [Kubernetes1], [Kubernetes2] | |
Oct 1 | No class | National Day Holidays | |||
Oct 8 | High-level Data Query Languages for Big Data Analytics: Pig and Hive | 3:30pm - 6:15pm | [PigLatin], [Hive1], [Hadoop]Ch.16-17 | [Hive2], [Hive3], [HiveAdvances], [Pig], [Hive] | |
Oct 15 | Programming Models (beyond MapReduce) for Big Data Computing: Stream-based Processing: Storm | 3:30pm - 6:15pm | [Storm@Twitter], [StormApplied], [Heron] | [BigData] | |
Oct 22 | Lambda Architecture Kappa Architecture; Unified Log: Apache Kafka Apache Samza | 3:30pm - 6:15pm | [KafkaBook],[Samza],[KleppmannMSSS] | - | |
Oct 29 | Programming Models (beyond MapReduce) for Big Data Computing: Graph-based Computing frameworks: Pregel/Giraph and GraphLab | 3:30pm - 6:15pm | [GraphLab1], [PowerGraph] | [GraphChi] | |
Nov 5 | Programming Models (beyond MapReduce) for Big Data Computing: Spark: Spark and BDAS, Quick Tour of Scala, Spark RDDs | 3:30pm - 6:15pm | [Spark2018] | [SparkScaling], [MapReduceVsSpark], [LearnSpark]Ch.1, Ch.10 ; [SparkAnalytics] Appendix A | |
Dec 17 | Q&A Assignment Deadline | 11:59pm | |||
Dec 30 | Project Deadline | 11:59pm |
Course Assessment
(Original)Your grade will be based on the following components:
* Homework & Programming assignments (5 sets in total): 40%
* Project: 20%
* Final Exam: 40% (2-hour final examination)
(New)Your grade will be based on the following components:
- Homework & Programming assignments (4 sets in total): 40%
- Project: 35%
- Q & A writing assignment: 25%
Student/Faculty Expectations on Teaching and Learning
http://mobitec.ie.cuhk.edu.hk/StaffStudentExpectations.pdf
Academic Honesty
You are expected to do your own work and acknowledge the use of anyone else's words or ideas. You MUST put down in your submitted work the names of people with whom you have had discussions.
Refer to http://www.cuhk.edu.hk/policy/academichonesty for details
When scholastic dishonesty is suspected, the matter will be turned over to the University authority for action.
You MUST include the following signed statement in all of your submitted homework, project assignments and examinations. Submission without a signed statement will not be graded.
I declare that the assignment here submitted is original except for source material explicitly acknowledged, and that the same or related material has not been previously submitted for another course. I also acknowledge that I am aware of University policy and regulations on honesty in academic work, and of the disciplinary guidelines and procedures applicable to breaches of such policy and regulations, as contained in the website http://www.cuhk.edu.hk/policy/academichonesty/.
Acknowledgement
Thanks to Amazon Web Services, Google and Microsoft Azure for providing free computing resource support of this course