References - IEMS5730 Big Data Systems and Information Processing / Spring 2024

Recommended Text

[HadoopAppArch] Hadoop Application Architectures 1st Edition, by Mark Grover, Ted Malaska, Jonathan Seidman and Gwen Shapira, Publisher: O’Reilly Media, July 2015.
[Hadoop] Hadoop: The Definitive Guide 4th Edition, by Tom White, published by Oreilly, April 2015.
[JLin] Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, Morgan and Claypool Publishers, 2010, can be freely downloaded from http://lintool.github.io/MapReduceAlgorithms/
[DataIntensive] Designing Data-Intensive Applications: The Big Ideas behind Reliable, Scalable and Maintainable Systems, Preview Edition, by Martin Kleppmann, Publisher: O’Reilly Media, 1st Edition to be published in 2016.
[StormApplied] Storm Applied, by Sean T. Allen, Matthew Jankowski and Peter Pathirana, Publisher: Manning, 2015
[BigData] Big Data: Principles and Best Practices of Scalable Realtime Data Systems, by Nathan Marz and James Warren, Publisher: Manning, 2015
[NoSQL] NoSQL Overview, Appendix A of the book titled “Graph Databases”, by Ian Robinson, Jim Webber and Emil Eifrem (Can request a free copy from http://graphdatabases.com)
[MMDS] Mining of Massive Datasets (Download version 1.3) by Anand Rajaraman, Jeff Ullman and Jure Leskovec, Cambridge University Press. Latest version can be downloaded from http://i.stanford.edu/~ullman/mmds.html#latest
[LearnSpark] Learning Spark: Lightning-Fast Big Data Analysis, 1st Edition, by Karau, Konwinski, Wendell and Zaharia, published by Oreilly, 2015
[LearnSpark2ndEd] Learning Spark: Lightning-Fast Big Data Analysis, 2nd Edition, by Jules S. Damji, Brooke Wenig, Tathagata Das and Denny Lee, published by O’reilly, July 2020
[Spark 1] Apache® Spark™ Analytics Made Simple, http://go.databricks.com/apache-spark-analytics-made-simple-databricks
[Spark 2] Mastering Advanced Analytics with Apache Spark®,http://go.databricks.com/mastering-advanced-analytics-apache-spark
[Spark 3] Lessons for Large-Scale Machine Learning Deployments on Apache Spark,http://go.databricks.com/large-scale-machine-learning-deployments-spark-databricks
[Spark 4] Mastering Apache Spark 2.0, http://go.databricks.com/mastering-apache-spark-2.0
[CloudComputing] Cloud Computing for Science and Engineering, by Ian Foster and Dennis B. Gannon, https://cloud4scieng.org/chapters/
[KafkaBook] Neha Narkhede, Gwen Shapira, Todd Palino, Kafka: The Definitive Guide, published by O’Reilly Media, July 2017, https://book.huihoo.com/pdf/confluent-kafka-definitive-guide-complete.pdf
[KleppmannMSSS] Martin Kleppmann, Making Sense of Stream Processing, published by O’Reilly Media, Mar 2016, https://www.oreilly.com/data/free/stream-processing.csp
[Samza] Martin Kleppmann, “Apache Samza,” a chapter on Apache Samza for the Encyclopedia of Big Data Technologies, March 2018, https://martin.kleppmann.com/papers/samza-encyclopedia.pdf
[EncyclopBigData] Encyclopedia of Big Data Technologies, Springer Link, First Online: April 2018. https://link.springer.com/referenceworkentry/10.1007/978-3-319-63962-8_303-1
[Streaming101] Tyler Akidau, “Streaming 101: The world beyond batch - A high-level tour of modern data-processing concepts,” Aug 2015 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[Streaming102] Tyler Akidau, “Streaming 102: The world beyond batch - The what, where, when and how of unbounded data processing,” Jan 2016, https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
[StreamingSys] Tyler Akidau, Slava Chernyak, Reuven Lax, Streaming Systems, published by O’Reilly Media, July 2018, http://shop.oreilly.com/product/0636920073994.do
[Flink] Paris Carbone et al, “Apache Flink: Stream and Batch Processing in a Single Engine,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, http://asterios.katsifodimos.com/assets/publications/flink-deb.pdf
[FlinkBook1] Ellen Friedman, Kostas Tzoumas, Introduction to Apache Flink, published by O’Reilly Media, Oct 2016, online free version accessible from: https://mapr.com/ebooks/intro-to-apache-flink/
[FlinkBook2] Vasiliki Kalavri, Fabian Hueske, Stream Processing with Apache Flink, (Early Release Edition), published by O’Reilly Media, Feb 2018, https://www.oreilly.com/library/view/stream-processing-with/9781491974285/ https://info.lightbend.com/rs/558-NCX-702/images/preview-apache-flink.pdf
[Spark2018] Bill Chambers, Matei Zaharia, Spark: The Definitive Guide: Big Data Processing Made Simple (1st Edition), published by O’Reilly Media Feb 2018, http://shop.oreilly.com/product/0636920034957.do

Recommended Programming References

[SparkAnalytics] Advanced Analytics with Spark, by Sandy Ryza, Uri Laserson, Sean Owen and Josh Wills, Publisher: O’Reilly Media, April 2015
[DataAlgorithms] Data Algorithms: Recipes for Scaling Up with Hadoop and Spark, by Mahmoud Parsian, Publisher: O’Reilly Media, Aug 2015
[LearnSpark] Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, Publisher: O’Reilly Media, Feb 2015
[HBase] HBase: The Definitive Guide, by Lars George, published by O’Reilly Media,.
[CassandraBook] Cassandra: The Definitive Guide, by Eben Hewitt, published by O’Reilly Media,.
[ZooKeeper] ZooKeeper: Distributed Process Coordination, by Flavio Junqueira and Benjamin Reed, published by O’Reilly Media, 2013
[Pig] Programming Pig, by Alan Gates, published by O’Reilly Media
[Hive] Programming Hive, by Edward Capriolo, Dean Wampler, Jason Rutherglen, published by O’Reilly Media,
[OpenStackOp] OpenStack Operations Guide, published by O’Reilly Media, (current-version available online at: http://docs.openstack.org/openstack-ops/content )
[OozieBook] Apache Oozie : The Workflow Scheduler for Hadoop, by Mohammad Kamrul Islam, Aravind Srinivasan published by O’Reilly Media, 2015.
[Storm] Hortonworks Data Platform - Apache Storm Component Guidehttps://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_storm-component-guide/bk_storm-component-guide.pdf
[KubernetesTutorial] Kubernetes Tutorial for Beginners [FULL COURSE in 4 Hours], TechWorld with Nana https://www.youtube.com/watch?v=X48VuDVv0do&t=9106s

Readings in Database Systems (commonly known as the “Red Book”)

[Red Book] Readings in Database Systems, 5th Edition, by Peter Bailis, Joseph M. Hellerstein, Michael Stonebraker, (current-version available online at: http://www.redbook.io)

[JLinUWaterloo] INST 767 Big Data Infrastructure, by Jimmy Lin, University of Waterloo Course INST 767, http://lintool.github.io/UMD-courses/bigdata-2015-Spring
[UIUCcs498] CS498 Cloud Computing, by Roy Campbell and Reza Farivar, UIUC.
[UPennNETS] NETS212 Scalable and Cloud Computing, by Andreas Haeberlen, UPenn
[UPennCS555] CIS455/555 Internet and Web Systems, by Andreas Haeberlen, UPenn
[CornellBirman] CS5412 Cloud Computing, by Ken Birman, Cornell
[CMUQatar] 15-319 Cloud Computing, by M. F. Sakr and M. Hammoud, CMU Qatar
[LASERsummer2013] Software for the Cloud and Big Data, 10th LASER Summer School on Software Engineering, Sept 2013, http://laser.inf.ethz.ch/2013/lectures.php
[TwitterUCB] Analyzing Big Data with Twitter, by Marti Hearst et al, UC Berkeley School of Information, Course i290, http://blogs.ischool.berkeley.edu/i290-abdt-s12/
[JLeskovecMMDS] Mining Massive Data Sets, by Jure Leskovec, Stanford Course CS246, http://www.stanford.edu/class/cs246/
[ASmolaUCB] Scalable Machine Learning, by Alex Smola, UC Berkeley Course Statistics 241B, CS281B, http://alex.smola.org/teaching/berkeley2012/
[WCohenCMU] Machine Learning with Large Datasets, by William W. Cohen, CMU Course 10-605 http://curtis.ml.cmu.edu/w/courses/index.php/Machine_Learning_with_Large_Datasets_10-605_in_Spring_2014
[HadoopMasterClass14] Lars George (https://twitter.com/larsgeorge), “Cloudera Hadoop MasterClass, Rittman Mead BI Forum 2014, https://blog.daanalytics.nl/2014/05/08/rm-bi-forum-2014-notes-cloudera-hadoop-masterclass/
[HadoopLabs&Tutorials] http://www.coreservlets.com/hadoop-tutorial/
[UC Berkeley CS186 Introduction to Database Management Systems Course Materials]

General Big Data

[Tim Harford] Tim Harford,Big data: are we making a big mistake?
[Hammond11] Kevin Hammond, “Why Parallel Functional Programming Matters: Panel Statement”, Reliable Software Technologies, Ada-Europe 2011, LNCS Vol. 6652, 2011, http://link.springer.com/book/10.1007/978-3-642-21338-0

Infrastructure for Big Data Processing/ Cloud Computing

[DataCenter]The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, by Luiz Andre Barroso and Urs Holzle, Published by Morgan and Claypool, 2009, http://bnrg.eecs.berkeley.edu/~randy/Courses/CS294.F09/wharehousesizedcomputers.pdf
[CloudData] Siba Mohammad, Sebastian Breb, Eike Schallenhn, “Cloud Data Management: A Short Overview and Comparison of Current Approaches,” 24th GI-Workshop on Foundations of Databases, May 2012.http://ceur-ws.org/Vol-850/paper_mohammad.pdf
[JupiterRising]Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network, SIGCOMM, 2015, http://www.datascienceassn.org/sites/default/files/Jupiter Rising - A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network.pdf

MapReduce and other Big Data Processing Platforms

[Cloudera] Cloudera Developer Training for Apache Hadoop, http://cloudera.com/content/cloudera/en/training/courses/developer-training.html , http://www.cloudera.com/content/dam/cloudera/Resources/PDF/Developer_Training_for_Apache_Hadoop.pdf
[MMDSHadoopLabs] Mining Massive Data Sets: Hadoop Labs, by Daniel Templeton and Jure Leskovec, Stanford Course CS246H, http://www.stanford.edu/class/cs246h/
[PlatformsKentU] Advanced computing Platforms for Data Processing, by Ruoming Jin, Kent State University Course http://www.cs.kent.edu/~jin/Cloud12Spring/Cloud.html
[BDAS] The Berkeley Data Analytics Stack (BDAS), https://amplab.cs.berkeley.edu/software/
[Mahout] Apache Mahout: Scalable Machine Learning and Data Mining, http://mahout.apache.org
[TeraSort] TeraByte Sort on Apache Hadoop, Yahoo, http://sortbenchmark.org/YahooHadoop.pdf
[TeraSort] TeraSort using Hadoop, http://www.slideshare.net/tungld/terasort
[Kay Ousterhout] Kay Ousterhout Ryan Rasti, Sylvia Ratnasamy, Scott Shenker and Byung-Gon Chun, Making Sense of Performance in Data Analytics Frameworks, OSDI 2015. https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-ousterhout.pdf

Mining Massive Graphs and Graph-based Processing Platforms

[PowerLaw] Zipf, Power-Laws and Pareto: A Ranking Tutorial, by L. Adamic, http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html
[Pregel] G. Malewicz et al, “Pregel: A System for Large-Scale Graph Processing,” ACM SIGMOD 2010.
[GraphLab] GraphLab: Large-scale Machine Learning on Graphs, http://www.cs.cmu.edu/~ylow/thesis/thesis.pdf
[GraphLab2] Carlos Guestrin et al, “GraphLab 2: Parallel Machine Learning for Large-Scale Natural Graphs,” NIPS Big Learning Workshop 2011.
[GraphLab1] Yucheng Low, Joseph Gonzalez et al, “GraphLab: A New Framework for Parallel Machine Learning,” UAI 2010.
[PowerGraph] Joseph Gonzalez et al, “PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs,” OSDI 2012.

Data Stream Processing Algorithms

[GaroRamaUCB] CS286 Implementation of Database Systems, UC Berkeley, Minos Garofalakis, Raghu Ramakrishnan, http://db.cs.berkeley.edu/cs286sp07/
[JXu] A Tutorial on Network Data Streaming, by Jun (Jim) Xu, ACM Sigmetrics 2007, http://www.cc.gatech.edu/~jx/8803DS08/sigm07.pdf
[SmolaUCB] Stat 260 Scalable Machine Learning of UC Berkeley, by Alex Smola, CMU, http://alex.smola.org/teaching/berkeley2012/streams.html
[Heron] Maosong Fu, “Twitter Heron: Towards Extensible Streaming Engines”

High-level Big Data Query Language/ Processing Systems

Pig Cheat Sheet from Mortar: http://mortar-public-site-content.s3-website-us-east-1.amazonaws.com
Pig on Spark: https://cwiki.apache.org/confluence/display/PIG/Pig+on+Spark and its original effort - Spork: https://github.com/sigmoidanalytics/spork
Hive Cheat Sheet for SQL users from Hortonworks: http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/
A List of Subtle Differences Between HiveQL and SQL: http://spryinc.com/blog/list-subtle-differences-between-hiveql-and-sql
A VLDB 2015 tutorial on SQL-on-Hadoop Systems by Daniel Abadi et al: Abstract ; Slides

Big Data processing Architectures in the Real-World

[FacebookHive] Hive - A Peta-scale Data Warehouse System on Hadoop, by Ning Zhang, Data Infrastructure Team in Facebook, https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-data-warehouse-using-hadoop/89508453919
[FacebookDataArch] Peta-scale Data at Facebook, by Dhruba Borthakur, XLDB Conference at Stanford University, 2012 http://www-conf.slac.stanford.edu/xldb2012/talks/xldb2012_wed_1105_DhrubaBorthakur.pdf

Workflow Scheduling for Hadoop

[OozieSWEET12] Mohammad K. Islam et al, “Oozie: Towards a Scalable Workflow Management System for Hadoop”, ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies (SWEET), May 2012, current-version and SLIDES
[OozieHUG13] Mona Chitnis, “Oozie - Now and Beyond,” Hadoop User Group Sunnyvale,Oct 2013, https://www.slideshare.net/ydn/hadoop-meetup-hug-october-2013-oozie-4x
[OozieAWS16] Use Apache Oozie Workflows to Automate Apache Spark Jobs (and more!) on Amazon EMR, June 2016, https://aws.amazon.com/blogs/big-data/use-apache-oozie-workflows-to-automate-apache-spark-jobs-and-more-on-amazon-emr/
[OozieBook] Apache Oozie : The Workflow Scheduler for Hadoop, by Mohammad Kamrul Islam, Aravind Srinivasan published by O’Reilly Media, 2015.

Good Distributed Systems Courses/ Lectures:

Prof Martin Kleppmann, Cambridge:
- https://www.cl.cam.ac.uk/teaching/2122/ConcDisSys/
- https://www.youtube.com/playlist?list=PLeKd45zvjcDFUEv_ohr_HdUFe97RItdiB
Prof Steve Ko, University of Buffalo:
- https://cse.buffalo.edu/~stevko/courses/cse486/spring20/schedule.html
Prof. Frans Kaashoek, MIT:
- http://nil.csail.mit.edu/6.824/2021/schedule.html
Prof. Bob Morris, MIT:
- https://www.youtube.com/playlist?list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB
Prof. Lindsey Kuper, UC Santa Cruz:
Prof. Indranil Gupta, UIUC:
- https://courses.engr.illinois.edu/cs425/fa2022/lectures.html
Prof. Lorenzo Alvisi, Cornell:
- https://www.cs.cornell.edu/courses/cs5414/2017fa/
Prof. John Ousterhout and Diego Ongaro, lectures on Paxos and Raft (part of the Raft user study):
Dr. Chris Colohan
- http://www.distributedsystemscourse.com/
- https://www.youtube.com/@DistributedSystems

Recommended Text

Recommended List of Research Papers for Reading

Recommended Programming References

Readings in Database Systems (commonly known as the “Red Book”)

Related Courses offered Elsewhere

General Big Data

Infrastructure for Big Data Processing/ Cloud Computing

MapReduce and other Big Data Processing Platforms

Mining Massive Graphs and Graph-based Processing Platforms

Data Stream Processing Algorithms

High-level Big Data Query Language/ Processing Systems

Big Data processing Architectures in the Real-World

Workflow Scheduling for Hadoop

Good Distributed Systems Courses/ Lectures: