References
Highly Recommended Textbooks

[MMDS] Mining of Massive Datasets (the 3rd version) by Anand Rajaraman, Jeff Ullman and Jure Leskovec, Cambridge University Press. Latest version can be downloaded from MMDS Homepage or Mining of Massive Datasets.pdf

[JLin] DataIntensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, Morgan and Claypool Publishers, 2010, can be freely downloaded from http://lintool.github.io/MapReduceAlgorithms/

[CBishop] Pattern Recognition and Machine Learning by Christopher M. Bishop, Published by Springer Science and Business, 2007.

[MLE/MAP] Estimating Probabilities: MLE and MAP http://www.cs.cmu.edu/~tom/mlbook/Joint_MLE_MAP.pdf

[HTF] Elements of Statistical Learning 2nd Edition by Trevor Hastie, Robert Tibshirani, Jerome H. Friedman, Published by Springer, 2009. Ebook version can be downloaded from: http://link.springer.com/book/10.1007/9780387848587 via a CUHK IP address

[JWHT] An Introduction to Statistical Learning with Applications in R, by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Published by Springer, 2013. Ebook version can be downloaded from: http://link.springer.com/book/10.1007/9781461471387 via a CUHK IP address

[PCA] Principal Component Analysis, 2nd Edition, by I.T. Jolliffe, Published by Springer 2002, Ebook version can be download from: http://cda.psych.uiuc.edu/statistical_learning_course/Jolliffe%20I.%20Principal%20Component%20Analysis%20(2ed.,%20Springer,%202002)(518s)MVsa.pdf

[ShaliziADAEPV] Cosma Rohilla Shalizi, “Advanced Data Analysis from an Elementary Point of View”, Cambridge University Press, 2014. Draft available for download from: http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/

[ShaprioStockman] Shaprio and Stockman, Computer Vision, 2000, Chapter 4.24.9, https://courses.cs.washington.edu/courses/cse576/book/ch4.pdf

[Blum] Blum, Avrim, John Hopcroft, and Ravindran Kannan. “Foundations of Data Science.” (2017): https://www.cs.cornell.edu/jeh/book%20no%20so;utions%20March%202019.pdf
Additional References
General Readings
 [Top10] X.Wu et al, “Top 10 Algorithms in Data Mining,”, Knowledge Information System (2008) 14:137, also available at: http://www.cs.umd.edu/~samir/498/10Algorithms08.pdf
The Netflix Challenge

Blending 101 http://pragmatictheory.blogspot.hk/2008/07/blending101.html

Lessons from the Netflix Prize by Yehuda Koren Slides

Lessons from the Netflix Prize Challenge http://kdd.org/exploration_files/6Netflix1.pdf

Factorization meets the neighborhood: a multifaceted collaborative filtering model http://dl.acm.org/citation.cfm?id=1401944

Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=0c953e630cccd64d2ad5fbb09f08425e7f82b7a3

Modeling Relationships at Multiple Scales to ImproveAccuracy of Large Recommender Systems http://dl.acm.org/ft_gateway.cfm?id=1281206

Matrix Factorization Techniques For Recommender Systems https://datajobs.com/datasciencerepo/RecommenderSystems[Netflix].pdf

Collaborative Filtering with Temporal Dynamics http://www.cc.gatech.edu/~zha/CSE8801/CF/kddfp074koren.pdf

The BellKor 2008 Solution to the Netflix Prize https://assetpdf.scinapse.io/prod/54392637/54392637.pdf

The BellKor solution to the Netflix Prize http://pzs.dstu.dp.ua/DataMining/recom/bibl/ProgressPrize2007_KorBell.pdf

Try This at Home by Simon Funk http://sifter.org/~simon/journal/20061211.html

Matrix Factorization: A Simple Tutorial and Implementation in Python http://www.quuxlabs.com/blog/2010/09/matrixfactorizationasimpletutorialandimplementationinpython/
Similar/Relevant Courses offered Elsewhere

[ANg] Machine Learning by Andrew Ng, Stanford CS229 Course Notes, https://cs229.stanford.edu/

[ChakDataStream] CS49 Data Stream Algorithms, Amit Chakrabarti, Dartmouth College, Fall 2011, http://www.cs.dartmouth.edu/~ac/Teach/CS49Fall11/Notes/lecnotes.pdf

[JLeskovecMMDS] Mining Massive Data Sets, by Jure Leskovec, Stanford Course CS246, http://www.stanford.edu/class/cs246/

[ASmolaUCB] Scalable Machine Learning, by Alex Smola, UC Berkeley Course Statistics 241B, CS281B, http://alex.smola.org/teaching/berkeley2012/

[TwitterUCB] Analyzing Big Data with Twitter, by Marti Hearst et al, UC Berkeley School of Information, Course i290, http://blogs.ischool.berkeley.edu/i290abdts12/

[TellAvivU13] Edith Cohen, Amos Fiat, Haim Kaplan, Paula TaShma, Tova Milo, CS 0368.3239, Leveraging Big Data, Fall 2013/2014, TAU (Tel Aviv University) http://www.cohenwang.com/edith/bigdataclass2013

[JLinUMD] DataIntensive Information Proessing Applications, by Jimmy Lin, University of Maryland Course INFM718G/CMSC838G, http://lintool.github.io/UMDcourses/bigdata2010Spring/syllabus.html

[WCohenCMU] Machine Learning with Large Datasets, by William W. Cohen, CMU Course 10605 http://curtis.ml.cmu.edu/w/courses/index.php/Machine_Learning_with_Large_Datasets_10605_in_Spring_2014

[ASmolaCMU] Introduction to Machine Learning, by Alex Smola, CMU Course 10701 http://alex.smola.org/teaching/cmu201310701x/index.html

[ShaliziADAEPV] Cosma Rohilla Shalizi, “Advanced Data Analysis from an Elementary Point of View”, Cambridge University Press, 2014. Draft available for download from: http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/

[AMoore] Statistical Data Mining Tutorials – Tutorial Slides by Andrew W. Moore, https://www.cs.cmu.edu/~./awm/tutorials/list.html

[HadoopHacking] Guide for Happy Hadoop Hacking, http://curtis.ml.cmu.edu/w/courses/index.php/Guide_for_Happy_Hadoop_Hacking

[Strange10] Gilbert Strang, Linear Algebra, 2010.
Cloud Computing

[DataCenter] The Datacenter as a Computer: An Introduction to the Design of WarehouseScale Machines, Second Edition, by Luiz Andre Barroso and Urs Holzle, Published by Morgan and Claypool, 2013, http://web.eecs.umich.edu/~mosharaf/Readings/DCComputer.pdf

[CloudData] Siba Mohammad, Sebastian Breb, Eike Schallenhn, “Cloud Data Management: A Short Overview and Comparison of Current Approaches,” 24th GIWorkshop on Foundations of Databases, May 2012. https://ceurws.org/Vol850/paper_mohammad.pdf

[Tim Harford] Tim Harford, Big data: are we making a big mistake?
Data Stream Algorithms
 [ChakDataStream] CS49 Data Stream Algorithms, Amit Chakrabarti, Dartmouth College, Fall 2011, http://www.cs.dartmouth.edu/~ac/Teach/CS49Fall11/Notes/lecnotes.pdf
 [AggaDataStream] Charu C. Aggarwal (Ed.), Data Stream Models and Algorithms, Springer 2007.
 [PIndykStreaming] Streaming etc., by Piotr Indyk, a short course given at Rice University, 2009, http://people.csail.mit.edu/indyk/Rice/
 [PIndykSvH] Sketching via Hashing: from Heavy Hitters to Compressive Sensing to Sparse Fourier Transform: by Piotr Indyk slides and writeup, at PODS, 2013.
 [IndykPhDOpen12] Piotr Indyk, MIT, “Data Stream Algorithms,” Open lectures for PhD Students in Computer Science 2012, http://phdopen.mimuw.edu.pl/index.php?page=z11w3#zal
 [JXu] A Tutorial on Network Data Streaming, by Jun (Jim) Xu, ACM Sigmetrics 2007, http://www.cc.gatech.edu/~jx/8803DS08/sigm07.pdf
 [GGR] Querying and Mining Data Streams – You Only Get One Look: A Tutorial by Minos Garofalakis, Johannes Gehrke, Rajeev Rastogi, VLDB 2002.,
 [GaroRamaUCB] CS286 Implementation of Database Systems, UC Berkeley, Minos Garofalakis, Raghu Ramakrishnan, http://db.cs.berkeley.edu/cs286sp07/
 [SmolaUCB] Stat 260 Scalable Machine Learning of UC Berkeley, by Alex Smola, CMU, http://alex.smola.org/teaching/berkeley2012/streams.html
 [FM85] Probabilistic Counting Algorithms for Data Base Applications, Phillippe Flajolet and G.Nigel Martin, Journal of Computer and System Sciences (JCSS), 1985.
 [DF03] Loglog counting of large cardinalities, M. Durand and P. Flajolet, European Symposium on Algorithms 2003
 [FFGM07] Hyperloglog: The analysis of a nearoptimal cardinality estimation algorithm, P. Flajolet, Eric Fusy, O. Gandouet, and F. Meunier, Conference on Analysis of Algorithms, 2007
 [Whang 90] A lineartime probabilistic counting algorithm for database applications, K.Y. Whang, B. T. VanderZanden, and H. M. Taylor, ACM Transaction on Database Systems (TODS), 1990
 [AMS96] The space complexity of approximating the frequency moments, Noga Alon, Yossi Matias and Mario Szegedy, ACM STOC 1996, JCSS 1999
 [CH08] G.Cormode, M.Hadjieleftheriou, “Finding Frequent Items in Data Streams,” VLDB 2008
 [CM05] What’s hot and what’s not: tracking most frequent items dynamically, Graham Cormode and S. Muthukrishnan, ACM TODS’ 05
 [ACHPWY12] Agarwal, Cormode, Huang, Phillips, Wei, and Yi, Mergeable Summaries, PODS 2012.
 [SpaceBound] Spaceoptimal Heavy Hitters with Strong Error Bounds, RADU BERINDE, PIOTR INDYK, GRAHAM CORMODE, MARTIN J. STRAUSS, TODS’ 10, http://dimacs.rutgers.edu/~graham/pubs/papers/countersj.pdf
 [BloomSurvey] Bloom Filter survey by Broder & Mitzenmacher http://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
 [CountMin] CountMin sketch https://sites.google.com/site/countminsketch/
MapReduce and other Big Data Processing Platforms

[MMDSHadoopLabs] Mining Massive Data Sets: Hadoop Labs, by Daniel Templeton and Jure Leskovec, Stanford Course CS246H, http://snap.stanford.edu/class/cs2462017/cs246h.html

[PlatformsKentU] Advanced computing Platforms for Data Processing, by Ruoming Jin, Kent State University Course http://www.cs.kent.edu/~jin/Cloud12Spring/Cloud.html
 [BDAS] The Berkeley Data Analytics Stack (BDAS), https://amplab.cs.berkeley.edu/software/
 [TeraSort] TeraByte Sort on Apache Hadoop, http://sortbenchmark.org/YahooHadoop.pdf
Mining Massive Graphs and Graphbased Processing Platforms
 [PowerLaw] Zipf, PowerLaws and Pareto: A Ranking Tutorial, by L. Adamic, https://www.hpl.hp.com/research/idl/papers/ranking/ranking.html
 [Pregel] G. Malewicz et al, “Pregel: A System for LargeScale Graph Processing,” ACM SIGMOD 2010.
 [GraphLab2] Carlos Guestrin et al, “GraphLab 2: Parallel Machine Learning for LargeScale Natural Graphs,” NIPS Big Learning Workshop 2011.
 [GraphLab1] Yucheng Low, Joseph Gonzalez et al, “GraphLab: A New Framework for Parallel Machine Learning,” UAI 2010.
 [PowerGraph] Joseph Gonzalez et al, “PowerGraph: Distributed GraphParallel Computation on Natural Graphs,” OSDI 2012.
LocalitySensitive Hashing

[PIndykLSH] LocalitySensitive Hashing (LSH) Algorithm and Implementation (E2LSH), by Piotr Indyk, http://web.mit.edu/andoni/www/LSH/index.html

[ZG] Reza Bosagh Zadeh and Ashish Goel, “Dimension Independent Similarity Computation,” version 4, May 2013. http://arxiv.org/abs/1206.2082
Dimension Reduction

[SVD] Notes on Singular Value Decomposition by Edo Liberty of Yahoo/ Yale which contains the proof on (i) the existence and uniqueness of SVD for any matrix, and (2) SVD gives the best lowrank approximation

[GuruswamiKannan] Notes on “Singular Value Decomposition” ( Chapter 4 of http://www.cs.cmu.edu/~venkatg/teaching/CStheoryinfoage/hopcroftkannanfeb2012.pdf ), of CMU Course 15496/15859X: Computer Science Theory for Information Age http://www.cs.cmu.edu/~venkatg/teaching/CStheoryinfoage/, Spring 2012.

[POWERMETHOD] “Power Method to compute Singular/ Eigenvalues and Eigenvectors”
 [SVDPCA] Rasmus Elsborg Madsen, Lars Kai Hansen and Ole Winther, “Singular Value Decomposition and Principal Component Analysis,” Feb 2004, http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=4000
(Please note that there are a few typos in this note though: e.g,
 In the 1st line of page 4, the eigenvectors should be called the principal AXIS or principal DIRECTIONS, NOT principal components.
 3 lines above Eq. 13 in page 4, it should be “n » m” instead of “n « m”.
 A different convention used in this note for the original input matrix (as compared to our class notes): Here, each row of the original input data matrix corresponds to a FEATURE (or Attribute) whereas each column corresponds to a datapoint.

[LSaul] Nonlinear Dimension Reduction – a Tutorial by Lawrence Saul, NIPS 2005, http://www.robots.ox.ac.uk/~cvrg/michaelmas2007/nips05_nldr.pdf
 [ANgCS229PCA] Machine Learning by Andrew Ng, Stanford CS229 Course Notes on PCA: https://cs229.stanford.edu/notes2020spring/cs229notes10.pdf
 [2DSVD] TwoDimensional Singular Value Decomposition (2DSVD) for 2D Maps and Images
Recommendation Systems

[Netflix09] Yehuda Koren, Robert Bell and Chris Volinsky, “Matrix Factorization Techniques for Recommendation Systems,” IEEE Computer, August 2009.

[KorenTalk] Yehuda Koren, “Chasing $1000000: How we Won the Netflix Progress Prize,” Page 4 to Page 12

[Mahout] Apache Mahout: Scalable Machine Learning and Data Mining, http://mahout.apache.org
Gradient Descent
 [Pedregosa18] Fabian Pedregosa, “A birdseye view of optimization algorithms”, November 2018.
 This webpage provides nice, interactive visualization of how GD and SGD behave under different settings, e.g. learning rate/ stepsize, etc”
 [Sra18] Suvrit Sra, Lecture 25: Stochastic Gradient Descent, 2018.
 Prof. Sra gave a nice onedimension example (starting 22:50/53:03) to illustrate why SGD works so well at the beginning stage (in terms of moving in the right direction towards the optimal point even though only ONE random datapoint is used to “compute” the required direction of movement.)