References - IERG5050 AI Foundation Models, Systems and Applications / Spring 2025

Textbooks/ Recommended Readings

[LLM] Large Language Models: A Deep Dive. Bridging Theory and Practice. Uday Kamath, et al. Free access via CUHK IP https://link.springer.com/book/10.1007/978-3-031-65647-7
[ORFM] On the Opportunities and Risks of Foundation Models. Rishi Bommasani, et al. https://arxiv.org/abs/2108.07258, https://crfm.stanford.edu/assets/report.pdf
[SparksAGI] Sparks of Artificial General Intelligence: Early experiments with GPT-4. Sébastien Bubeck, et al. https://arxiv.org/abs/2303.12712
[GPT4TechReport] GPT-4 Technical Report. OpenAI, et al. https://arxiv.org/abs/2303.08774
[GPT4SysCard] GPT-4 System Card. OpenAI. https://cdn.openai.com/papers/gpt-4-system-card.pdf
[8LLM] Eight Things to Know about Large Language Models. Samuel R. Bowman. https://arxiv.org/abs/2304.00612
[ZeRO] ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. Samyam Rajbhandari, et al. https://arxiv.org/abs/1910.02054
[ScalingNLM] Scaling Laws for Neural Language Models. Jared Kaplan, et al. hhttps://arxiv.org/abs/2001.08361
[ScalingAGM] Scaling Laws for Autoregressive Generative Modeling. Tom Henighan, et al. https://arxiv.org/abs/2010.14701
[FewShot] Language Models are Few-Shot Learners. Tom B. Brown, et al. https://arxiv.org/abs/2005.14165
[Attention] Attention Is All You Need. Ashish Vaswani, et al. https://arxiv.org/abs/1706.03762.pdf
[EmergentAbi] Emergent Abilities of Large Language Models. Jason Wei, et al. https://arxiv.org/abs/2206.07682
[IncontextBayesian] An Explanation of In-context Learning as Implicit Bayesian Inference. Sang Michael Xie, et al. https://arxiv.org/abs/2111.02080.pdf
[FM4Decision] Foundation Models for Decision Making: Problems, Methods, and Opportunities. Sherry Yang, et al. https://arxiv.org/abs/2303.04129
[PredicFutrueAI] Does One Large Model Rule Them All ? - Prediction on the Future AI Ecosystem. Maithra Raghu, et al. https://maithraraghu.com/blog/2023/does-one-model-rule-them-all/
[ALM] Augmented Language Models: a Survey. Cregoire Mialon et al, https://arxiv.org/abs/2302.07842
[DL] Deep Learning. MIT Press. Goodfellow, I. et al, https://mitpress.mit.edu/9780262035613/deep-learning/
[RL] Reinforcement Learning: An Introduction. MIT Press. Sutton, R.S. et al, https://mitpress.mit.edu/9780262039246/reinforcement-learning/

Course Material

Overview of Foundation Models in AI; Pre-Transformer Era Deep Learning models

[LLMintro] The busy person’s Intro to LLMs, by Andrej Karpathy. https://www.youtube.com/watch?v=zjkBMFhNj_g&t=418s
[Word2Vec] Efficient Estimation of Word Representations in Vector Space. Tomas Mikolov, et al. https://arxiv.org/pdf/1301.3781
[Word2Vec-Note] Note on Word2Vec, Stanford CS224n https://web.stanford.edu/class/cs224n/readings/cs224n_winter2023_lecture1_notes_draft.pdf
[Word2Vec-Explained] Word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. Yoav Goldberg, Omer Levy. https://arxiv.org/abs/1402.3722
[seq2seqNMT] Sequence to Sequence Learning with Neural Networks. Ilya Sutskever, et al. https://arxiv.org/pdf/1409.3215.pdf
[BERT] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Jacob Devlin, et al. https://arxiv.org/abs/1810.04805
[illustrated-BERT] The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning). https://jalammar.github.io/illustrated-bert/
[NeurProb] A neural probabilistic language model. Yoshua Bengio, et al. https://dl.acm.org/doi/10.5555/944919.944966
[NeurNetwork] Background on Neural Networks (course series from Andrej Karpathy). https://github.com/karpathy/nn-zero-to-hero
[ScalingInterpre] Scaling Laws and Interpretability of Learning from Repeated Data, Danny Hernandez, et al. https://arxiv.org/abs/2205.10487

Transformer: Basic Architecture

[VisualGPT] But what is a GPT? Visual intro to Transformers, Deep learning, Chapter 5. https://www.3blue1brown.com/lessons/gpt
[VisualAtt] Visualizing Attention, a Transformer’s Heart, Deep Learning, Chapter 6. https://www.3blue1brown.com/lessons/attention
[VisualMLP] How might LLMs store facts, Deep Learning, Chapter 7. https://www.3blue1brown.com/lessons/mlp
[VisualTrans] Big Techday 24: Visualizing Transformers - Grant Sanderson. https://www.youtube.com/watch?v=rZTV_1TcVyY
[buildGPT] Let’s build GPT: from scratch, in code, spelled out. By Andrej Karpathy. https://www.youtube.com/watch?v=kCc8FmEb1nY&t=3286s
[illustrated-Transformer] The Illustrated Transformer. https://jalammar.github.io/illustrated-transformer/
[WhatTransformer] But What are Transformers? Prof. Jia-Bin Huang of UMD. https://www.youtube.com/watch?v=rcWMRA9E5RI
[RotaryPosiEmbed] How Rotary Position Embedding Supercharges Modern LLMs. Prof. Jia-Bin Huang of UMD.https://www.youtube.com/watch?v=SMBkImDWOyQ
[SLP] Large Language Models, Speech and Language Processing, Chapter 10. https://web.stanford.edu/~jurafsky/slp3/10.pdf
[LLaMa] LLaMA: Open and Efficient Foundation Language Models, Hugo Touvron, et al. https://arxiv.org/abs/2302.13971
[LLaMa2] Llama 2: Open Foundation and Fine-Tuned Chat Models, Hugo Touvron, et al. https://arxiv.org/abs/2307.09288

Efficient Transformer Architectures

[Sys4FM] Christopher Re, “Systems for Foundation Models, and Foundation Models for Systems,” Keynote for NeurIPS 2023.https://nips.cc/virtual/2023/invited-talk/73990
[MoD] Mixture of Experts (MoD) paradigm and the Switch Transformer, William Fedus, talk for Stanford CS25 V1, Fall 2021, https://www.youtube.com/watch?v=U8J32Z3qV8s
[SwitchTransformers] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, Noam Shazeer, Jan 2021, https://arxiv.org/abs/2101.03961
[Mistral7B] “Mistral 7B”, Albert Q. Jiang et al, Oct 2023, https://arxiv.org/pdf/2310.06825
[MixtralExperts] “Mixtral of Experts,” Albert Q. Jiang et al, Jan 2024, https://arxiv.org/abs/2401.04088
[DemystifyingMoE] “Demystifying Mixtral of Experts,” Albert Q. Jiang, talk for Stanford CS25 V4, April 2024, https://www.youtube.com/watch?v=RcJ1YXHLv5o
[DeepSeekMoE] DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, Damai Dai et al, Jan 2024, https://arxiv.org/abs/2401.06066
[LongContextExtension] A Controlled Study on Long Context Extension and Generalization in LLMs, Yi Lu et al, https://arxiv.org/pdf/2409.12181, Video by Prof. Alexander (Sasha) Rush of Cornell https://www.youtube.com/watch?v=dc4chADushM
[FlashAttention] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao, et al. https://arxiv.org/abs/2205.14135

Beyond Transformers: Alternative Architectures for LLMs

[SSM] On the Tradeoffs of State Space Models, talk by Prof. Albert GU (author of Mamba) at Simon Institute https://simons.berkeley.edu/talks/albert-gu-carnegie-mellon-university-2024-09-27, same talk at ICML 2024 https://icml.cc/virtual/2024/39087
[Post-TransformerArch] 2024 in Post-Transformer Architectures: State Space Models, RWKV [Latent Space LIVE! @ NeurIPS 2024] https://www.youtube.com/watch?v=LPe6iC73lrc, https://www.latent.space/p/2024-post-transformers
[Mamba] Mamba, Mamba-2 and Post-Transformer Architectures for Generative AI with Albert Gu. https://www.youtube.com/watch?v=yceNl9C6Ir0
[introRWKV&WorldToken] Beyond Transformers - Intro to RWKV Architecture & The World Tokenizer. Eugene Cheah & Harrison Vanderbyl, Dec 2023 https://www.youtube.com/watch?v=I-HMKky7Qsw
[Eagle7B&RWKVv5] Performance of Eagle-7B, an RWKV-v5 Architecture, Jan 2024 https://www.youtube.com/watch?v=gHdRgfmAVIw
[RWKVv7] Latest model as of Feb 2025: RWKV-v7 (Goose) https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v7/rwkv_v7_demo.py

Pre-training of Foundation Models and Data Preparation

[Data] Data for Training FMs, Stanford CS324 https://stanford-cs324.github.io/winter2022/lectures/data/
[DataCentric] Foundation Models are Entering their Data-Centric Era https://hazyresearch.stanford.edu/blog/2022-10-11-datacentric-fms
[C4] Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus, Jesse Dodge, et al. https://arxiv.org/pdf/2104.08758
[PretrainCS224n] Pretraining, Stanford CS224n https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture9-pretraining.pdf
[OPT-talk] Data Preparation and Pretraining of Foundation Models: Open Pretrained Transformers:
https://www.youtube.com/live/p9IxoSkvZ-M?si=c7tzS4Th8v7hO9aK
[AIHardware] Notes on AI Hardware, Stanford MLSys #88 https://www.youtube.com/watch?v=PlraH57ey4k
[DLHardware] Directions in Deep Learning Hardware https://www.youtube.com/watch?v=gofI47kfD28
Some related slides are available at: https://cra.org/wp-content/uploads/2024/08/Deep-Learning-Hardware-Session-Slides.pdf
[Software2.0] Building the Software 2.0 Stack, Spark+AI Summit 2018 https://vimeo.com/274274744
[RobotBrainS3E3] The Robot Brain Podcast Season 3 Episode 3: Scale Founder CEO Alexandr Wang , In AI, Your Data Writes the Program https://www.youtube.com/watch?v=BNqb3Xuv-0A
[NewinPretrain] What’s New in Pre-training? https://drive.google.com/file/d/1lw0hfxHAshcKupxMW51F5zeV1PeDe34w/view

Distributed/ Parallel Training of Foundation Models

[Megatron] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, Mohammad Shoeybi, et al. https://arxiv.org/abs/1909.08053
[DeepSpeedZeRO] DeepSpeed and ZeRO. https://www.deepspeed.ai/tutorials/zero/
[OPT] Training Processs Sec2.5 of OPT: Open Pre-trained Transformer Language Models, Susan Zhang, et al. https://arxiv.org/abs/2205.01068/ also see full logbook of training

Post-training/ Adaptation of Foundation Models

[UnderstandInContext] How does in-context learning work? A framework for understanding the differences from traditional supervised learning. https://ai.stanford.edu/blog/understanding-incontext/
[ZeroShot] Finetuned Language Models Are Zero-Shot Learners, Jason Wei, et al. https://arxiv.org/abs/2109.01652
[UnsuperMulti] Language Models are Unsupervised Multitask Learners, Alec Radford, et al. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
[CoT] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Jason Wei, et al. https://arxiv.org/abs/2201.11903
[HumanFeedback] Training language models to follow instructions with human feedback, Long Ouyang, et al. https://arxiv.org/abs/2203.02155
[PromptEngg] Prompt Engineering. https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/
[IntrepreTrans] Intrepretability with Transformer by Chris Olah. https://www.youtube.com/live/pC4zRb_5noQ?t=26s
[LoRA] LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu, et al. https://arxiv.org/abs/2106.09685
[MEND] Fast Model Editing at Scale, Eric Mitchell, et al. https://arxiv.org/abs/2110.11309
[ChenD.] Lecture Series in AI: “Training Language Models in Academia: Research Questions and Opportunities” https://www.youtube.com/watch?v=zcNOgC6R-GU
[Post4AIApp.] How to Approach Post-Training for AI Applications https://youtu.be/grpc-Wyy-Zg
[GRPO-NLambert] GRPO’s new variants and implementation secretshttps://www.youtube.com/watch?v=amrJDwMUFNs. Here is also a blog of Nathan Lambert which related to the video above: https://www.interconnects.ai/p/papers-im-reading-base-model-rl-grpo?utm_source=publication-search
[RLHF-NLambert] RLHF (Draft book) by Nathan Lambert, https://rlhfbook.com/
[GRPO-Julia] DeepSeek’s GRPO (Group Relative Policy Optimization), Reinforcement Learning for LLMs https://www.youtube.com/watch?v=xT4jxQUl0X8&t=23s
[PPO-Julia] Proximal Policy Optimization (PPO) for LLMs Explained Intuitively https://www.youtube.com/watch?v=8jtAzxUwDj0

Teaching LLMs to Reason and Reasoning LLMs

[CoTw/oPrompt] Chain-of-Thought Reasoning Without Prompting, Xuezhi Wang, et al. https://arxiv.org/abs/2402.10200
[LLMSelfCorrect] Large Language Models Cannot Self-Correct Reasoning Yet, Jie Huang, et al. https://arxiv.org/abs/2310.01798
[PremiseOrder] Premise Order Matters in Reasoning with Large Language Models, Xinyun Chen, et al. https://arxiv.org/abs/2402.08939
[CoTSerialProb] Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, Zhiyuan Li, et al. https://arxiv.org/abs/2402.12875
[SuperAlign] Introducing Superalignment. https://openai.com/index/introducing-superalignment/
[MisguidAtt] Misguided Attention. https://github.com/cpldcpu/MisguidedAttention
[MemoReason] On Memorization of Large Language Models in Logical Reasoning, Chulin Xie, et al. https://arxiv.org/abs/2410.23123
[Test-TimeScaling] Sasha Rush, Daniel Ritter, “Speculations on Test-Time Scaling (o1),” https://www.youtube.com/watch?v=6PEJ96k1kiw, https://srush.github.io/awesome-o1/o1-tutorial.pdf, https://www.youtube.com/watch?v=YR9EztOF0R8, https://github.com/srush/awesome-o1
[LearningtoReason] Noah Goodman, “Learning to Reason, Insights from Language Modeling,” MIT Center for Brains, Minds+ Machines (MITCBMM) seminar, https://www.youtube.com/watch?v=YR9EztOF0R8

Retrieval Augmented Generation (RAG) and Tool Use for Foundation Models

[RAG] Retrieval-Augmented Generation for Large Language Models: A Survey, Yunfan Gao, et al. https://arxiv.org/abs/2312.10997
[GraphRAG] GraphRAG https://www.microsoft.com/en-us/research/project/graphrag/
[GraphFoundaModel] Towards Graph Foundation Models: A Survey and Beyond, Jiawei Liu, et al. https://arxiv.org/pdf/2310.11829
[UseToolatScale] Teaching LLMs to Use Tools at Scale - Shishir Patil, Stanford MLSys #98, https://www.youtube.com/watch?v=WAvO8FTDJ8M
[FrontiersRAG] Akari Asai (University of Washington): “Beyond Scaling: Frontiers of Retrieval-Augmented Language Models”: https://asset.seas.upenn.edu/event/akari-asai-university-of-washington/

Applications of Foundation Models ; Beyond LLMs

[ProgSynthesis] Program Synthesis with Large Language Models, Jacob Austin, et al. https://arxiv.org/pdf/2108.07732
[AlphaCode] Competition-level code generation with AlphaCode, Yujia Li, et al. https://www.science.org/doi/full/10.1126/science.abq1158
[Codex] Evaluating Large Language Models Trained on Code, Mark Chen, et al. https://arxiv.org/abs/2107.03374
[ImageTrans] Image Transformer, Niki Parmar, et al. https://arxiv.org/pdf/1802.05751
[TransferableVisual] Learning Transferable Visual Models From Natural Language Supervision, Alec Radford, et al. https://arxiv.org/abs/2103.00020
[Text2Image] Zero-Shot Text-to-Image Generation, Aditya Ramesh, et al. https://arxiv.org/abs/2102.12092
[TransferLimit] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel, et al. https://colinraffel.com/publications/jmlr2020exploring.pdf
[Flamingo] Flamingo: a Visual Language Model for Few-Shot Learning Jean-Baptiste Alayrac, et al. https://arxiv.org/abs/2204.14198
[DecisionTrans] Decision Transformer: Reinforcement Learning via Sequence Modeling, Lili Chen, et al. https://arxiv.org/abs/2106.01345, https://www.youtube.com/watch?v=83QN9S-0I84, https://www.youtube.com/watch?v=w4Bw8WYL8Ps
[GR00T] Project GR00T: A Blueprint for Generalist Robotics Jim Fan, NVIDIA. https://www.youtube.com/live/Qhxr0uVT2zs
[Voyager] Voyager: An Open-Ended Embodied Agent with Large Language Models, Guanzhi Wang, et al. https://voyager.minedojo.org/
[Eureka] Eureka: Human-Level Reward Design via Coding Large Language Models, Jason Ma, et al. https://eureka-research.github.io
[DrEureka] DrEureka: Language Model Guided Sim-To-Real Transfer, Jason Ma, et al. https://eureka-research.github.io/dr-eureka/

Agentic AI Applications

[ReAct] ReAct: Synergizing Reasoning and Acting in Language Models, Shunyu Yao, et al. https://arxiv.org/abs/2210.03629
[CompoundAI] The Shift from Models to Compound AI Systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
[FutureCompoundAI] Large Language Models Get the Hype, but Compound Systems Are the Future of AI. https://www.youtube.com/watch?v=vRTcE19M-KE&t=959s&ab_channel=StanfordOnline
[GoogleAgent] “Agents” whitepaper from Google, which describes the Google’s approach in supporting Agentic applications: https://drive.google.com/file/d/1oEjiRCTbd54aSdB_eEe3UShxLBWK9xkt/view
[MicrosoftAutoGen] Microsoft’s Open-source Framework: AutoGen: http://llmagents-learning.org/slides/autogen.pdf,https://arxiv.org/pdf/2308.08155,https://microsoft.github.io/autogen/0.2/
[Smolagents] Smolagents: simple agents that write actions in code –This framework emphasizes the line of asking LLM agents to generate “codes” to drive actions. https://huggingface.co/docs/smolagents/index
[DSPy] DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines Omar Khattab, et al. https://arxiv.org/abs/2310.03714, https://dspy.ai/, https://github.com/stanfordnlp/dspy, http://llmagents-learning.org/slides/dspy_lec.pdf, https://www.youtube.com/live/JEMYuzrKLUw
[LangChain] LangChain for LLM Application Development. https://learn.deeplearning.ai/courses/langchain/lesson/1/introduction, https://github.com/afondiel/LangChain-For-LLM-Application-Dev-DeepLearningAI?tab=readme-ov-file
[OpenAPIPrompt] Best practices for prompt engineering with the OpenAI API. https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api
[Autogen&LlamaIndex] Autogen and LlamaIndex by Chi Wang and Jerry Liu respectively, Guest Lecture for UC Berkeley MOOC on Large Language Model Agents Fall 2024, https://www.youtube.com/watch?v=OOdtmCMSOo4
[MultimodalAgent] Ruslan Salakhutdinov, “Multimodal Autonomous AI Agents,” Guest Lecture for UC Berkeley MOOC on Advanced Large Language Model Agents, Spring 2025, https://rdi.berkeley.edu/adv-llm-agents/slides/ruslan-multimodal.pdf, https://www.youtube.com/live/RPINOYM12RU

AI Safety and Security

[MLSys79] AI Safety, RLHF, and Self-Supervision - Jared Kaplan, Stanford MLSys #79. https://www.youtube.com/watch?v=fqC3D-zNJUM&ab_channel=StanfordMLSysSeminars
[ConstitutionalAI] Constitutional AI: Harmlessness from AI Feedback. https://arxiv.org/pdf/2212.08073
[AgentSafety] Towards Building Safe & Trustworthy AI Agents and A Path for Science- and Evidence-based AI Policy. https://rdi.berkeley.edu/llm-agents/assets/dawn-agent-safety.pdf
[LivingoffCopilot] Michael Bargury, “Living off Microsoft Copilot,” Black Hat USA, Briefings, Aug 2024. https://www.youtube.com/watch?v=-YJgcTCSzU0
[HackingAI] H. Ben-Sasson, S. Tzadik, Isolation or Hallucination? Hacking AI, Infrastructure Providers for Fun and Weights, Black Hat USA Briefings, Aug 2024. https://www.youtube.com/watch?v=mhZ1It6lb4M&list=PLH15HpR5qRsUiLYPNSylDvlskvS_RSzee&index=50