• [MapReduce] Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI 2004.

  • [GoogleFileSystem] Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung, “The Google File System,” ACM SOSP 2003.

  • [MapReduceFamilySurvey2013] Sakr S, Liu A, Fayoumi A G. The family of mapreduce and large-scale data processing systems[J]. ACM Computing Surveys (CSUR), 2013.

  • [YARN] V.K. Vavilapalli, A.C.Murthy, “Apache Hadoop YARN: Yet Another Resource Negotiator,” ACM Symposium on Cloud Computing (SoCC) 2013.

  • [Mesos] B. Hindman et al, “Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center”, NSDI 2011.

  • [DRF] A. Ghodsi et al, “Dominant Resource Fairness: Fair Allocation of Multiple Resource Types,” NSDI 2011.

  • [Borg] A. Verma, L. Pedrosa, “Large-scale cluster management at Google with Borg”, Eurosys 2015

  • [BorgOmegaK8s] Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, John Wilkes, “Borg, Omega and Kubernetes,” ACM Queue Magazine, Vol. 14, No. 1, Jan 2016, https://dl.acm.org/doi/10.1145/2898442.2898444

  • [Omega] M. Schwarzkopf, A. Konwinski, M.Abd-El-Malek, J. Wilkes, “Omega: flexible, scalable schedulers for large compute clusters,” Eurosys 2013

  • [Sparrow] K. Ousterhout et al, “Sparrow: Distributed, Low Latency Scheduling”, ACM SOSP 2013

  • [Apollo] E. Boutin et al, “Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing”, OSDI 2014

  • [Mercury] K. Karanasos et al, “Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters”, Usenix ATC 2015

  • [GraphLab1] Yucheng Low, Joseph Gonzalez et al, “GraphLab: A New Framework for Parallel Machine Learning,” UAI 2010.

  • [PowerGraph] Joseph Gonzalez et al, “PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs,” OSDI 2012.

  • [GraphChi] Aapo Kyrola, Guy Blelloch, Carlos Guestrin, “GraphChi: Large-Scale Graph Computation on Just a PC,” OSDI 2012.

  • [Storm@Twitter] Ankit Toshniwal et al, “Storm@Twitter,” ACM SIGMOD 2014.

  • [PigLatin] Christopher Olston et al, “Pig Latin: A Not-So-Foreign Language for Data Processing,” ACM SIGMOD 2008.

  • [Hive1] Ashish Thusoo et al, “Hive: a warehousing solution over a map-reduce framework,” VLDB 2009.

  • [Hive2] Ashish Thusoo et al, “Data warehousing and analytics infrastructure at facebook,” ACM SIGMOD 2010

  • [Hive3] Ashish Thusoo et al, “Hive - A Petabyte Scale Data Warehouse Using Hadoop,” IEEE ICDE 2010.

  • [HiveAdvances] Yin Huai et al, “Major Technical Advancements in Apache Hive,” ACM SIGMOD 2014.

  • [Dryad] Michael Isard et al, “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks,” Eurosys 2007.

  • [DryadLINQ] Yuan Yu et al, “DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language, “ OSDI 2008.

  • [DryadLINQ2] Michael Isard, Yuan Yu, “Distributed Data-Parallel Computing Using a High-Level Programming Language,” ACM SIGMOD 2009

  • [Tez] Bikas Saha et al, “Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications,” ACM SIGMOD 2015.

  • [Dynamo] Giuseppe DeCandia et al, “Dynamo: Amazon’s Highly Available Key-value Store,” ACM SOSP 2007.

  • [BigTable] Fay Chang et al, “Bigtable: A Distributed Storage System for Structured Data,” OSDI 2006.

  • [Cassandra] Avinash Lakshman, Prashant Malik, “Cassandra - A Decentralized Structured Storage System,” ACM SIGOPS Operating Systems Review, Apr 2010.

  • [RealtimeHadoopFacebook] Dhruba Borthakur et al, “Apache Hadoop goes realtime at Facebook,” ACM SIGMOD 2011.

  • [SparkRDD] Matei Zaharia et al, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” NSDI 2012.

  • [Spark] Matei Zaharia et al, “Fast and Interactive Analytics over Hadoop Data with Spark,” Usenix ;login Aug 2012.

  • [Spark Streaming] Matei Zaharia et al, “Discretized streams: Fault-tolerant streaming computation at scale,” ACM SOSP 2013.

  • [SharkSQL] Reynold S. Xin et al, “Shark: SQL and rich analytics at scale,” ACM SIGMOD 2013

  • [SparkSQL] Michael Armbrust et al, “Spark SQL: Relational Data Processing in Spark,” ACM SIGMOD 2015.

  • [GraphX] Joseph E. Gonzalez et al, “GraphX: Graph Processing in a Distributed Dataflow Framework,” OSDI 2014.

  • [SparkScaling] Michael Armburst et al, “Scaling Spark in the real world: performance and usability,” VLDB 2015.

  • [MapReduceVsSpark] Juwei Shi et al, “Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics,” VLDB 2015.

  • [SparkMLbase] T. Kraska, A. Talwalkar, J.Duchi, R. Griffith, M. Franklin, M.I. Jordan, “MLbase: A Distributed Machine Learning System,” In Conference on Innovative Data Systems Research (CIDR), 2013.

  • [SparkMLI] E. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, J. Gonzalez, M. Franklin, M. I. Jordan, T. Kraska. MLI: An API for Distributed Machine Learning. In International Conference on Data Mining, 2013.

  • [SparkMLlib] Xiangrui Meng et al, “MLlib: Machine learning in Apache Spark,” arXiv:1505.06807, May 2015.

  • [SparkNet] Philipp Moritz, Robert Nishihara, Ion Stoica, Michael Jordan, “SparkNet: Training Deep Networks on Spark,” ICLR 2016.

  • [Naiad] Derek G. Murray et al, “Naiad: A Timely Dataflow System,” ACM SOSP 2013.

  • [ZooKeeper1] P Hunt, M Konar, FP Junqueira, B Reed , “ZooKeeper: Wait-free Coordination for Internet-scale Systems,” Usenix ATC 2010.

  • [ZAB1] Benjamin Reed, Flavio P. Junqueira, “A simple totally ordered broadcast protocol,” 2nd Workshop on Large-scale Distributed Systems and Middleware (LADIS), 2008.

  • [ZAB2] F.P. Junqueira, B.C. Reed, M. Serafini, “High-performance broadcast for primary-backup systems,” IEEE/IFIP DSN, 2011.

  • [Kubernetes1] Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes, “Borg, Omega, and Kubernetes - Lessons Learned from Three Container-Management Systems over a decade,” ACM Queue, Jan 2016.

  • [Kubernetes2] David Rensin, Kubernetes - Scheduling the Future at Cloud Scale, (Free eBook) published by O’Reilly 2015.

  • [SparkAnalytics] Advanced Analytics with Spark, by Sandy Ryza, Uri Laserson, Sean Owen and Josh Wills, Publisher: O’Reilly Media, April 2015

  • [DataAlgorithms] Data Algorithms: Recipes for Scaling Up with Hadoop and Spark, by Mahmoud Parsian, Publisher: O’Reilly Media, Aug 2015

  • [LearnSpark] Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, Publisher: O’Reilly Media, Feb 2015

  • [HBase] HBase: The Definitive Guide, by Lars George, published by O’Reilly Media,.

  • [CassandraBook] Cassandra: The Definitive Guide, by Eben Hewitt, published by O’Reilly Media,.

  • [ZooKeeper] ZooKeeper: Distributed Process Coordination, by Flavio Junqueira and Benjamin Reed, published by O’Reilly Media, 2013

  • [Pig] Programming Pig, by Alan Gates, published by O’Reilly Media

  • [Hive] Programming Hive, by Edward Capriolo, Dean Wampler, Jason Rutherglen, published by O’Reilly Media,

  • [OpenStackOp] OpenStack Operations Guide, published by O’Reilly Media, (current-version available online at: http://docs.openstack.org/openstack-ops/content )

  • [OozieBook] Apache Oozie : The Workflow Scheduler for Hadoop, by Mohammad Kamrul Islam, Aravind Srinivasan published by O’Reilly Media, 2015.

  • [Storm] Hortonworks Data Platform - Apache Storm Component Guidehttps://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_storm-component-guide/bk_storm-component-guide.pdf

  • [KubernetesTutorial] Kubernetes Tutorial for Beginners [FULL COURSE in 4 Hours], TechWorld with Nana https://www.youtube.com/watch?v=X48VuDVv0do&t=9106s

Readings in Database Systems (commonly known as the “Red Book”)

  • [Red Book] Readings in Database Systems, 5th Edition, by Peter Bailis, Joseph M. Hellerstein, Michael Stonebraker, (current-version available online at: http://www.redbook.io)

General Big Data

Infrastructure for Big Data Processing/ Cloud Computing

MapReduce and other Big Data Processing Platforms

Mining Massive Graphs and Graph-based Processing Platforms

  • [PowerLaw] Zipf, Power-Laws and Pareto: A Ranking Tutorial, by L. Adamic, http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html
  • [Pregel] G. Malewicz et al, “Pregel: A System for Large-Scale Graph Processing,” ACM SIGMOD 2010.
  • [GraphLab] GraphLab: Large-scale Machine Learning on Graphs, http://www.cs.cmu.edu/~ylow/thesis/thesis.pdf
  • [GraphLab2] Carlos Guestrin et al, “GraphLab 2: Parallel Machine Learning for Large-Scale Natural Graphs,” NIPS Big Learning Workshop 2011.
  • [GraphLab1] Yucheng Low, Joseph Gonzalez et al, “GraphLab: A New Framework for Parallel Machine Learning,” UAI 2010.
  • [PowerGraph] Joseph Gonzalez et al, “PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs,” OSDI 2012.

Data Stream Processing Algorithms

High-level Big Data Query Language/ Processing Systems

Big Data processing Architectures in the Real-World

Workflow Scheduling for Hadoop

Good Distributed Systems Courses/ Lectures

  • By Prof. Martin Kleppmann, Cambridge: WEB Video

  • By Prof Steve Ko, University of Buffalo: WEB

  • By Prof. Frans Kaashoek, MIT: WEB
    • the schedule contains links to UNLISTED Youtube video by Prof. Frans Kaashoek
    • alternatively, Prof. Bob Morris of MIT also offered a different version of of the course: Video
  • By Prof. Lindsey Kuper, UC Santa Cruz: WEB
  • By Prof. Indranil Gupta of UIUC: WEB

  • By Prof. Lorenzo Alvisi of Cornell: WEB

  • By Prof. John Ousterhout and Diego Ontario, lectures on Paxos and Raft (part of the Raft user study):
  • By Dr. Chris Colohan, WEB Video