References
Recommended Text
-
[HadoopAppArch] Hadoop Application Architectures 1st Edition, by Mark Grover, Ted Malaska, Jonathan Seidman and Gwen Shapira, Publisher: O’Reilly Media, July 2015.
-
[Hadoop] Hadoop: The Definitive Guide 4th Edition, by Tom White, published by Oreilly, April 2015.
-
[JLin] Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, Morgan and Claypool Publishers, 2010, can be freely downloaded from http://lintool.github.io/MapReduceAlgorithms/
-
[DataIntensive] Designing Data-Intensive Applications: The Big Ideas behind Reliable, Scalable and Maintainable Systems, Preview Edition, by Martin Kleppmann, Publisher: O’Reilly Media, 1st Edition to be published in 2016.
-
[StormApplied] Storm Applied, by Sean T. Allen, Matthew Jankowski and Peter Pathirana, Publisher: Manning, 2015
-
[BigData] Big Data: Principles and Best Practices of Scalable Realtime Data Systems, by Nathan Marz and James Warren, Publisher: Manning, 2015
-
[NoSQL] NoSQL Overview, Appendix A of the book titled “Graph Databases”, by Ian Robinson, Jim Webber and Emil Eifrem (Can request a free copy from http://graphdatabases.com)
-
[MMDS] Mining of Massive Datasets (Download version 1.3) by Anand Rajaraman, Jeff Ullman and Jure Leskovec, Cambridge University Press. Latest version can be downloaded from http://i.stanford.edu/~ullman/mmds.html#latest
-
[LearnSpark] Learning Spark: Lightning-Fast Big Data Analysis, 1st Edition, by Karau, Konwinski, Wendell and Zaharia, published by Oreilly, 2015
-
[LearnSpark2ndEd] Learning Spark: Lightning-Fast Big Data Analysis, 2nd Edition, by Jules S. Damji, Brooke Wenig, Tathagata Das and Denny Lee, published by O’reilly, July 2020
-
[Spark 1] Apache® Spark™ Analytics Made Simple, http://go.databricks.com/apache-spark-analytics-made-simple-databricks
-
[Spark 2] Mastering Advanced Analytics with Apache Spark®,http://go.databricks.com/mastering-advanced-analytics-apache-spark
-
[Spark 3] Lessons for Large-Scale Machine Learning Deployments on Apache Spark,http://go.databricks.com/large-scale-machine-learning-deployments-spark-databricks
-
[Spark 4] Mastering Apache Spark 2.0, http://go.databricks.com/mastering-apache-spark-2.0
-
[CloudComputing] Cloud Computing for Science and Engineering, by Ian Foster and Dennis B. Gannon, https://cloud4scieng.org/chapters/
-
[KafkaBook] Neha Narkhede, Gwen Shapira, Todd Palino, Kafka: The Definitive Guide, published by O’Reilly Media, July 2017, https://book.huihoo.com/pdf/confluent-kafka-definitive-guide-complete.pdf
-
[KleppmannMSSS] Martin Kleppmann, Making Sense of Stream Processing, published by O’Reilly Media, Mar 2016, https://www.oreilly.com/data/free/stream-processing.csp
-
[Samza] Martin Kleppmann, “Apache Samza,” a chapter on Apache Samza for the Encyclopedia of Big Data Technologies, March 2018, https://martin.kleppmann.com/papers/samza-encyclopedia.pdf
-
[EncyclopBigData] Encyclopedia of Big Data Technologies, Springer Link, First Online: April 2018. https://link.springer.com/referenceworkentry/10.1007/978-3-319-63962-8_303-1
-
[Streaming101] Tyler Akidau, “Streaming 101: The world beyond batch - A high-level tour of modern data-processing concepts,” Aug 2015 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
-
[Streaming102] Tyler Akidau, “Streaming 102: The world beyond batch - The what, where, when and how of unbounded data processing,” Jan 2016, https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
-
[StreamingSys] Tyler Akidau, Slava Chernyak, Reuven Lax, Streaming Systems, published by O’Reilly Media, July 2018, http://shop.oreilly.com/product/0636920073994.do
-
[Flink] Paris Carbone et al, “Apache Flink: Stream and Batch Processing in a Single Engine,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, http://asterios.katsifodimos.com/assets/publications/flink-deb.pdf
-
[FlinkBook1] Ellen Friedman, Kostas Tzoumas, Introduction to Apache Flink, published by O’Reilly Media, Oct 2016, online free version accessible from: https://mapr.com/ebooks/intro-to-apache-flink/
-
[FlinkBook2] Vasiliki Kalavri, Fabian Hueske, Stream Processing with Apache Flink, (Early Release Edition), published by O’Reilly Media, Feb 2018, https://www.oreilly.com/library/view/stream-processing-with/9781491974285/ https://info.lightbend.com/rs/558-NCX-702/images/preview-apache-flink.pdf
-
[Spark2018] Bill Chambers, Matei Zaharia, Spark: The Definitive Guide: Big Data Processing Made Simple (1st Edition), published by O’Reilly Media Feb 2018, http://shop.oreilly.com/product/0636920034957.do
Recommended List of Research Papers for Reading
-
[MapReduce] Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI 2004.
-
[GoogleFileSystem] Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung, “The Google File System,” ACM SOSP 2003.
-
[MapReduceFamilySurvey2013] Sakr S, Liu A, Fayoumi A G. The family of mapreduce and large-scale data processing systems[J]. ACM Computing Surveys (CSUR), 2013.
-
[YARN] V.K. Vavilapalli, A.C.Murthy, “Apache Hadoop YARN: Yet Another Resource Negotiator,” ACM Symposium on Cloud Computing (SoCC) 2013.
-
[Mesos] B. Hindman et al, “Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center”, NSDI 2011.
-
[DRF] A. Ghodsi et al, “Dominant Resource Fairness: Fair Allocation of Multiple Resource Types,” NSDI 2011.
-
[Borg] A. Verma, L. Pedrosa, “Large-scale cluster management at Google with Borg”, Eurosys 2015
-
[BorgOmegaK8s] Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, John Wilkes, “Borg, Omega and Kubernetes,” ACM Queue Magazine, Vol. 14, No. 1, Jan 2016, https://dl.acm.org/doi/10.1145/2898442.2898444
-
[Omega] M. Schwarzkopf, A. Konwinski, M.Abd-El-Malek, J. Wilkes, “Omega: flexible, scalable schedulers for large compute clusters,” Eurosys 2013
-
[Sparrow] K. Ousterhout et al, “Sparrow: Distributed, Low Latency Scheduling”, ACM SOSP 2013
-
[Apollo] E. Boutin et al, “Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing”, OSDI 2014
-
[Mercury] K. Karanasos et al, “Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters”, Usenix ATC 2015
-
[GraphLab1] Yucheng Low, Joseph Gonzalez et al, “GraphLab: A New Framework for Parallel Machine Learning,” UAI 2010.
-
[PowerGraph] Joseph Gonzalez et al, “PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs,” OSDI 2012.
-
[GraphChi] Aapo Kyrola, Guy Blelloch, Carlos Guestrin, “GraphChi: Large-Scale Graph Computation on Just a PC,” OSDI 2012.
-
[Storm@Twitter] Ankit Toshniwal et al, “Storm@Twitter,” ACM SIGMOD 2014.
-
[PigLatin] Christopher Olston et al, “Pig Latin: A Not-So-Foreign Language for Data Processing,” ACM SIGMOD 2008.
-
[Hive1] Ashish Thusoo et al, “Hive: a warehousing solution over a map-reduce framework,” VLDB 2009.
-
[Hive2] Ashish Thusoo et al, “Data warehousing and analytics infrastructure at facebook,” ACM SIGMOD 2010
-
[Hive3] Ashish Thusoo et al, “Hive - A Petabyte Scale Data Warehouse Using Hadoop,” IEEE ICDE 2010.
-
[HiveAdvances] Yin Huai et al, “Major Technical Advancements in Apache Hive,” ACM SIGMOD 2014.
-
[Dryad] Michael Isard et al, “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks,” Eurosys 2007.
-
[DryadLINQ] Yuan Yu et al, “DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language, “ OSDI 2008.
-
[DryadLINQ2] Michael Isard, Yuan Yu, “Distributed Data-Parallel Computing Using a High-Level Programming Language,” ACM SIGMOD 2009
-
[Tez] Bikas Saha et al, “Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications,” ACM SIGMOD 2015.
-
[Dynamo] Giuseppe DeCandia et al, “Dynamo: Amazon’s Highly Available Key-value Store,” ACM SOSP 2007.
-
[BigTable] Fay Chang et al, “Bigtable: A Distributed Storage System for Structured Data,” OSDI 2006.
-
[Cassandra] Avinash Lakshman, Prashant Malik, “Cassandra - A Decentralized Structured Storage System,” ACM SIGOPS Operating Systems Review, Apr 2010.
-
[RealtimeHadoopFacebook] Dhruba Borthakur et al, “Apache Hadoop goes realtime at Facebook,” ACM SIGMOD 2011.
-
[SparkRDD] Matei Zaharia et al, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” NSDI 2012.
-
[Spark] Matei Zaharia et al, “Fast and Interactive Analytics over Hadoop Data with Spark,” Usenix ;login Aug 2012.
-
[Spark Streaming] Matei Zaharia et al, “Discretized streams: Fault-tolerant streaming computation at scale,” ACM SOSP 2013.
-
[SharkSQL] Reynold S. Xin et al, “Shark: SQL and rich analytics at scale,” ACM SIGMOD 2013
-
[SparkSQL] Michael Armbrust et al, “Spark SQL: Relational Data Processing in Spark,” ACM SIGMOD 2015.
-
[GraphX] Joseph E. Gonzalez et al, “GraphX: Graph Processing in a Distributed Dataflow Framework,” OSDI 2014.
-
[SparkScaling] Michael Armburst et al, “Scaling Spark in the real world: performance and usability,” VLDB 2015.
-
[MapReduceVsSpark] Juwei Shi et al, “Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics,” VLDB 2015.
-
[SparkMLbase] T. Kraska, A. Talwalkar, J.Duchi, R. Griffith, M. Franklin, M.I. Jordan, “MLbase: A Distributed Machine Learning System,” In Conference on Innovative Data Systems Research (CIDR), 2013.
-
[SparkMLI] E. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, J. Gonzalez, M. Franklin, M. I. Jordan, T. Kraska. MLI: An API for Distributed Machine Learning. In International Conference on Data Mining, 2013.
-
[SparkMLlib] Xiangrui Meng et al, “MLlib: Machine learning in Apache Spark,” arXiv:1505.06807, May 2015.
-
[SparkNet] Philipp Moritz, Robert Nishihara, Ion Stoica, Michael Jordan, “SparkNet: Training Deep Networks on Spark,” ICLR 2016.
-
[Naiad] Derek G. Murray et al, “Naiad: A Timely Dataflow System,” ACM SOSP 2013.
-
[ZooKeeper1] P Hunt, M Konar, FP Junqueira, B Reed , “ZooKeeper: Wait-free Coordination for Internet-scale Systems,” Usenix ATC 2010.
-
[ZAB1] Benjamin Reed, Flavio P. Junqueira, “A simple totally ordered broadcast protocol,” 2nd Workshop on Large-scale Distributed Systems and Middleware (LADIS), 2008.
-
[ZAB2] F.P. Junqueira, B.C. Reed, M. Serafini, “High-performance broadcast for primary-backup systems,” IEEE/IFIP DSN, 2011.
-
[Kubernetes1] Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes, “Borg, Omega, and Kubernetes - Lessons Learned from Three Container-Management Systems over a decade,” ACM Queue, Jan 2016.
-
[Kubernetes2] David Rensin, Kubernetes - Scheduling the Future at Cloud Scale, (Free eBook) published by O’Reilly 2015.
Recommended Programming References
-
[SparkAnalytics] Advanced Analytics with Spark, by Sandy Ryza, Uri Laserson, Sean Owen and Josh Wills, Publisher: O’Reilly Media, April 2015
-
[DataAlgorithms] Data Algorithms: Recipes for Scaling Up with Hadoop and Spark, by Mahmoud Parsian, Publisher: O’Reilly Media, Aug 2015
-
[LearnSpark] Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, Publisher: O’Reilly Media, Feb 2015
-
[HBase] HBase: The Definitive Guide, by Lars George, published by O’Reilly Media,.
-
[CassandraBook] Cassandra: The Definitive Guide, by Eben Hewitt, published by O’Reilly Media,.
-
[ZooKeeper] ZooKeeper: Distributed Process Coordination, by Flavio Junqueira and Benjamin Reed, published by O’Reilly Media, 2013
-
[Pig] Programming Pig, by Alan Gates, published by O’Reilly Media
-
[Hive] Programming Hive, by Edward Capriolo, Dean Wampler, Jason Rutherglen, published by O’Reilly Media,
-
[OpenStackOp] OpenStack Operations Guide, published by O’Reilly Media, (current-version available online at: http://docs.openstack.org/openstack-ops/content )
-
[OozieBook] Apache Oozie : The Workflow Scheduler for Hadoop, by Mohammad Kamrul Islam, Aravind Srinivasan published by O’Reilly Media, 2015.
-
[Storm] Hortonworks Data Platform - Apache Storm Component Guidehttps://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_storm-component-guide/bk_storm-component-guide.pdf
-
[KubernetesTutorial] Kubernetes Tutorial for Beginners [FULL COURSE in 4 Hours], TechWorld with Nana https://www.youtube.com/watch?v=X48VuDVv0do&t=9106s
Readings in Database Systems (commonly known as the “Red Book”)
- [Red Book] Readings in Database Systems, 5th Edition, by Peter Bailis, Joseph M. Hellerstein, Michael Stonebraker, (current-version available online at: http://www.redbook.io)
Related Courses offered Elsewhere
- [JLinUWaterloo] INST 767 Big Data Infrastructure, by Jimmy Lin, University of Waterloo Course INST 767, http://lintool.github.io/UMD-courses/bigdata-2015-Spring
- [UIUCcs498] CS498 Cloud Computing, by Roy Campbell and Reza Farivar, UIUC.
- [UPennNETS] NETS212 Scalable and Cloud Computing, by Andreas Haeberlen, UPenn
- [UPennCS555] CIS455/555 Internet and Web Systems, by Andreas Haeberlen, UPenn
- [CornellBirman] CS5412 Cloud Computing, by Ken Birman, Cornell
- [CMUQatar] 15-319 Cloud Computing, by M. F. Sakr and M. Hammoud, CMU Qatar
- [LASERsummer2013] Software for the Cloud and Big Data, 10th LASER Summer School on Software Engineering, Sept 2013, http://laser.inf.ethz.ch/2013/lectures.php
- [TwitterUCB] Analyzing Big Data with Twitter, by Marti Hearst et al, UC Berkeley School of Information, Course i290, http://blogs.ischool.berkeley.edu/i290-abdt-s12/
- [JLeskovecMMDS] Mining Massive Data Sets, by Jure Leskovec, Stanford Course CS246, http://www.stanford.edu/class/cs246/
- [ASmolaUCB] Scalable Machine Learning, by Alex Smola, UC Berkeley Course Statistics 241B, CS281B, http://alex.smola.org/teaching/berkeley2012/
- [WCohenCMU] Machine Learning with Large Datasets, by William W. Cohen, CMU Course 10-605 http://curtis.ml.cmu.edu/w/courses/index.php/Machine_Learning_with_Large_Datasets_10-605_in_Spring_2014
- [HadoopMasterClass14] Lars George (https://twitter.com/larsgeorge), “Cloudera Hadoop MasterClass, Rittman Mead BI Forum 2014, https://blog.daanalytics.nl/2014/05/08/rm-bi-forum-2014-notes-cloudera-hadoop-masterclass/
- [HadoopLabs&Tutorials] http://www.coreservlets.com/hadoop-tutorial/
- [UC Berkeley CS186 Introduction to Database Management Systems Course Materials]
- (Fall 2020) Offered by Prof. Alvin Cheung
- (Fall 2017) Offered by Prof. Joe Hellerstein
- (Spring 2020) Offered by Prof. Professor Josh Hug and Professor Michael Ball
- Lecture Notes and other Course Materials from Public Google Drive (Fall 2020 and Spring 2020)
- Lecture Notes and other Course Materials from Public Google Drive (Fall 2017)
- Lecture Videos from YouTube
General Big Data
- [Tim Harford] Tim Harford,Big data: are we making a big mistake?
- [Hammond11] Kevin Hammond, “Why Parallel Functional Programming Matters: Panel Statement”, Reliable Software Technologies, Ada-Europe 2011, LNCS Vol. 6652, 2011, http://link.springer.com/book/10.1007/978-3-642-21338-0
Infrastructure for Big Data Processing/ Cloud Computing
- [DataCenter]The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, by Luiz Andre Barroso and Urs Holzle, Published by Morgan and Claypool, 2009, http://bnrg.eecs.berkeley.edu/~randy/Courses/CS294.F09/wharehousesizedcomputers.pdf
- [CloudData] Siba Mohammad, Sebastian Breb, Eike Schallenhn, “Cloud Data Management: A Short Overview and Comparison of Current Approaches,” 24th GI-Workshop on Foundations of Databases, May 2012.http://ceur-ws.org/Vol-850/paper_mohammad.pdf
- [JupiterRising]Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network, SIGCOMM, 2015, http://www.datascienceassn.org/sites/default/files/Jupiter Rising - A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network.pdf
MapReduce and other Big Data Processing Platforms
- [Cloudera] Cloudera Developer Training for Apache Hadoop, http://cloudera.com/content/cloudera/en/training/courses/developer-training.html , http://www.cloudera.com/content/dam/cloudera/Resources/PDF/Developer_Training_for_Apache_Hadoop.pdf
- [MMDSHadoopLabs] Mining Massive Data Sets: Hadoop Labs, by Daniel Templeton and Jure Leskovec, Stanford Course CS246H, http://www.stanford.edu/class/cs246h/
- [PlatformsKentU] Advanced computing Platforms for Data Processing, by Ruoming Jin, Kent State University Course http://www.cs.kent.edu/~jin/Cloud12Spring/Cloud.html
- [BDAS] The Berkeley Data Analytics Stack (BDAS), https://amplab.cs.berkeley.edu/software/
- [Mahout] Apache Mahout: Scalable Machine Learning and Data Mining, http://mahout.apache.org
- [TeraSort] TeraByte Sort on Apache Hadoop, Yahoo, http://sortbenchmark.org/YahooHadoop.pdf
- [TeraSort] TeraSort using Hadoop, http://www.slideshare.net/tungld/terasort
- [Kay Ousterhout] Kay Ousterhout Ryan Rasti, Sylvia Ratnasamy, Scott Shenker and Byung-Gon Chun, Making Sense of Performance in Data Analytics Frameworks, OSDI 2015. https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-ousterhout.pdf
Mining Massive Graphs and Graph-based Processing Platforms
- [PowerLaw] Zipf, Power-Laws and Pareto: A Ranking Tutorial, by L. Adamic, http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html
- [Pregel] G. Malewicz et al, “Pregel: A System for Large-Scale Graph Processing,” ACM SIGMOD 2010.
- [GraphLab] GraphLab: Large-scale Machine Learning on Graphs, http://www.cs.cmu.edu/~ylow/thesis/thesis.pdf
- [GraphLab2] Carlos Guestrin et al, “GraphLab 2: Parallel Machine Learning for Large-Scale Natural Graphs,” NIPS Big Learning Workshop 2011.
- [GraphLab1] Yucheng Low, Joseph Gonzalez et al, “GraphLab: A New Framework for Parallel Machine Learning,” UAI 2010.
- [PowerGraph] Joseph Gonzalez et al, “PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs,” OSDI 2012.
Data Stream Processing Algorithms
- [GaroRamaUCB] CS286 Implementation of Database Systems, UC Berkeley, Minos Garofalakis, Raghu Ramakrishnan, http://db.cs.berkeley.edu/cs286sp07/
- [JXu] A Tutorial on Network Data Streaming, by Jun (Jim) Xu, ACM Sigmetrics 2007, http://www.cc.gatech.edu/~jx/8803DS08/sigm07.pdf
- [SmolaUCB] Stat 260 Scalable Machine Learning of UC Berkeley, by Alex Smola, CMU, http://alex.smola.org/teaching/berkeley2012/streams.html
- [Heron] Maosong Fu, “Twitter Heron: Towards Extensible Streaming Engines”
High-level Big Data Query Language/ Processing Systems
- Pig Cheat Sheet from Mortar: http://mortar-public-site-content.s3-website-us-east-1.amazonaws.com
- Pig on Spark: https://cwiki.apache.org/confluence/display/PIG/Pig+on+Spark and its original effort - Spork: https://github.com/sigmoidanalytics/spork
- Hive Cheat Sheet for SQL users from Hortonworks: http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/
- A List of Subtle Differences Between HiveQL and SQL: http://spryinc.com/blog/list-subtle-differences-between-hiveql-and-sql
- A VLDB 2015 tutorial on SQL-on-Hadoop Systems by Daniel Abadi et al: Abstract ; Slides
Big Data processing Architectures in the Real-World
- [FacebookHive] Hive - A Peta-scale Data Warehouse System on Hadoop, by Ning Zhang, Data Infrastructure Team in Facebook, https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-data-warehouse-using-hadoop/89508453919
- [FacebookDataArch] Peta-scale Data at Facebook, by Dhruba Borthakur, XLDB Conference at Stanford University, 2012 http://www-conf.slac.stanford.edu/xldb2012/talks/xldb2012_wed_1105_DhrubaBorthakur.pdf
Workflow Scheduling for Hadoop
- [OozieSWEET12] Mohammad K. Islam et al, “Oozie: Towards a Scalable Workflow Management System for Hadoop”, ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies (SWEET), May 2012, current-version and SLIDES
- [OozieHUG13] Mona Chitnis, “Oozie - Now and Beyond,” Hadoop User Group Sunnyvale,Oct 2013, https://www.slideshare.net/ydn/hadoop-meetup-hug-october-2013-oozie-4x
- [OozieAWS16] Use Apache Oozie Workflows to Automate Apache Spark Jobs (and more!) on Amazon EMR, June 2016, https://aws.amazon.com/blogs/big-data/use-apache-oozie-workflows-to-automate-apache-spark-jobs-and-more-on-amazon-emr/
- [OozieBook] Apache Oozie : The Workflow Scheduler for Hadoop, by Mohammad Kamrul Islam, Aravind Srinivasan published by O’Reilly Media, 2015.