References
Textbooks/ Recommended Readings
-
[LLM] Large Language Models: A Deep Dive. Bridging Theory and Practice. Uday Kamath, et al. Free access via CUHK IP https://link.springer.com/book/10.1007/978-3-031-65647-7
-
[ORFM] On the Opportunities and Risks of Foundation Models. Rishi Bommasani, et al. https://arxiv.org/abs/2108.07258, https://crfm.stanford.edu/assets/report.pdf
-
[SparksAGI] Sparks of Artificial General Intelligence: Early experiments with GPT-4. Sébastien Bubeck, et al. https://arxiv.org/abs/2303.12712
-
[GPT4TechReport] GPT-4 Technical Report. OpenAI, et al. https://arxiv.org/abs/2303.08774
-
[GPT4SysCard] GPT-4 System Card. OpenAI. https://cdn.openai.com/papers/gpt-4-system-card.pdf
-
[8LLM] Eight Things to Know about Large Language Models. Samuel R. Bowman. https://arxiv.org/abs/2304.00612
-
[ZeRO] ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. Samyam Rajbhandari, et al. https://arxiv.org/abs/1910.02054
-
[ScalingNLM] Scaling Laws for Neural Language Models. Jared Kaplan, et al. hhttps://arxiv.org/abs/2001.08361
-
[ScalingAGM] Scaling Laws for Autoregressive Generative Modeling. Tom Henighan, et al. https://arxiv.org/abs/2010.14701
-
[FewShot] Language Models are Few-Shot Learners. Tom B. Brown, et al. https://arxiv.org/abs/2005.14165
-
[Attention] Attention Is All You Need. Ashish Vaswani, et al. https://arxiv.org/abs/1706.03762.pdf
-
[EmergentAbi] Emergent Abilities of Large Language Models. Jason Wei, et al. https://arxiv.org/abs/2206.07682
-
[IncontextBayesian] An Explanation of In-context Learning as Implicit Bayesian Inference. Sang Michael Xie, et al. https://arxiv.org/abs/2111.02080.pdf
-
[FM4Decision] Foundation Models for Decision Making: Problems, Methods, and Opportunities. Sherry Yang, et al. https://arxiv.org/abs/2303.04129
-
[PredicFutrueAI] Does One Large Model Rule Them All ? - Prediction on the Future AI Ecosystem. Maithra Raghu, et al. https://maithraraghu.com/blog/2023/does-one-model-rule-them-all/
-
[ALM] Augmented Language Models: a Survey. Cregoire Mialon et al, https://arxiv.org/abs/2302.07842
-
[DL] Deep Learning. MIT Press. Goodfellow, I. et al, https://mitpress.mit.edu/9780262035613/deep-learning/
-
[RL] Reinforcement Learning: An Introduction. MIT Press. Sutton, R.S. et al, https://mitpress.mit.edu/9780262039246/reinforcement-learning/
Course Material
Overview of Foundation Models in AI; Pre-Transformer Era Deep Learning models
-
[LLMintro] The busy person’s Intro to LLMs, by Andrej Karpathy. https://www.youtube.com/watch?v=zjkBMFhNj_g&t=418s
-
[Word2Vec] Efficient Estimation of Word Representations in Vector Space. Tomas Mikolov, et al. https://arxiv.org/pdf/1301.3781
-
[Word2Vec-Note] Note on Word2Vec, Stanford CS224n https://web.stanford.edu/class/cs224n/readings/cs224n_winter2023_lecture1_notes_draft.pdf
-
[Word2Vec-Explained] Word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. Yoav Goldberg, Omer Levy. https://arxiv.org/abs/1402.3722
-
[seq2seqNMT] Sequence to Sequence Learning with Neural Networks. Ilya Sutskever, et al. https://arxiv.org/pdf/1409.3215.pdf
-
[BERT] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Jacob Devlin, et al. https://arxiv.org/abs/1810.04805
-
[illustrated-BERT] The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning). https://jalammar.github.io/illustrated-bert/
-
[NeurProb] A neural probabilistic language model. Yoshua Bengio, et al. https://dl.acm.org/doi/10.5555/944919.944966
-
[NeurNetwork] Background on Neural Networks (course series from Andrej Karpathy). https://github.com/karpathy/nn-zero-to-hero
-
[ScalingInterpre] Scaling Laws and Interpretability of Learning from Repeated Data, Danny Hernandez, et al. https://arxiv.org/abs/2205.10487
The Transformer Architecture
-
[VisualGPT] But what is a GPT? Visual intro to Transformers, Deep learning, Chapter 5. https://www.3blue1brown.com/lessons/gpt
-
[VisualAtt] Visualizing Attention, a Transformer’s Heart, Deep Learning, Chapter 6. https://www.3blue1brown.com/lessons/attention
-
[VisualMLP] How might LLMs store facts, Deep Learning, Chapter 7. https://www.3blue1brown.com/lessons/mlp
-
[VisualTrans] Big Techday 24: Visualizing Transformers - Grant Sanderson. https://www.youtube.com/watch?v=rZTV_1TcVyY
-
[buildGPT] Let’s build GPT: from scratch, in code, spelled out. By Andrej Karpathy. https://www.youtube.com/watch?v=kCc8FmEb1nY&t=3286s
-
[illustrated-Transformer] The Illustrated Transformer. https://jalammar.github.io/illustrated-transformer/
-
[WhatTransformer] But What are Transformers? Prof. Jia-Bin Huang of UMD. https://www.youtube.com/watch?v=rcWMRA9E5RI
-
[RotaryPosiEmbed] How Rotary Position Embedding Supercharges Modern LLMs. Prof. Jia-Bin Huang of UMD.https://www.youtube.com/watch?v=SMBkImDWOyQ
-
[SLP] Large Language Models, Speech and Language Processing, Chapter 10. https://web.stanford.edu/~jurafsky/slp3/10.pdf
Efficient Transformer Architectures
-
[Sys4FM] Christopher Re, “Systems for Foundation Models, and Foundation Models for Systems,” Keynote for NeurIPS 2023.https://nips.cc/virtual/2023/invited-talk/73990
-
[MoD] Mixture of Experts (MoD) paradigm and the Switch Transformer, William Fedus, talk for Stanford CS25 V1, Fall 2021, https://www.youtube.com/watch?v=U8J32Z3qV8s
-
[SwitchTransformers] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, Noam Shazeer, Jan 2021, https://arxiv.org/abs/2101.03961
-
[Mistral7B] “Mistral 7B”, Albert Q. Jiang et al, Oct 2023, https://arxiv.org/pdf/2310.06825
-
[MixtralExperts] “Mixtral of Experts,” Albert Q. Jiang et al, Jan 2024, https://arxiv.org/abs/2401.04088
-
[DemystifyingMoE] “Demystifying Mixtral of Experts,” Albert Q. Jiang, talk for Stanford CS25 V4, April 2024, https://www.youtube.com/watch?v=RcJ1YXHLv5o
-
[DeepSeekMoE] DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, Damai Dai et al, Jan 2024, https://arxiv.org/abs/2401.06066
-
[LongContextExtension] A Controlled Study on Long Context Extension and Generalization in LLMs, Yi Lu et al, https://arxiv.org/pdf/2409.12181, Video by Prof. Alexander (Sasha) Rush of Cornell https://www.youtube.com/watch?v=dc4chADushM
-
[FlashAttention] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao, et al. https://arxiv.org/abs/2205.14135
Beyond Transformers: Alternative Architectures for LLMs
-
[SSM] On the Tradeoffs of State Space Models, talk by Prof. Albert GU (author of Mamba) at Simon Institute https://simons.berkeley.edu/talks/albert-gu-carnegie-mellon-university-2024-09-27, same talk at ICML 2024 https://icml.cc/virtual/2024/39087
-
[Post-TransformerArch] 2024 in Post-Transformer Architectures: State Space Models, RWKV [Latent Space LIVE! @ NeurIPS 2024] https://www.youtube.com/watch?v=LPe6iC73lrc, https://www.latent.space/p/2024-post-transformers
-
[Mamba] Mamba, Mamba-2 and Post-Transformer Architectures for Generative AI with Albert Gu. https://www.youtube.com/watch?v=yceNl9C6Ir0
-
[introRWKV&WorldToken] Beyond Transformers - Intro to RWKV Architecture & The World Tokenizer. Eugene Cheah & Harrison Vanderbyl, Dec 2023 https://www.youtube.com/watch?v=I-HMKky7Qsw
-
[Eagle7B&RWKVv5] Performance of Eagle-7B, an RWKV-v5 Architecture, Jan 2024 https://www.youtube.com/watch?v=gHdRgfmAVIw
-
[RWKVv7] Latest model as of Feb 2025: RWKV-v7 (Goose) https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v7/rwkv_v7_demo.py
Data Preparation and Pretraining of Foundation Models
-
[Data] Data for Training FMs, Stanford CS324 https://stanford-cs324.github.io/winter2022/lectures/data/
-
[DataCentric] Foundation Models are Entering their Data-Centric Era https://hazyresearch.stanford.edu/blog/2022-10-11-datacentric-fms
-
[C4] Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus, Jesse Dodge, et al. https://arxiv.org/pdf/2104.08758
-
[PretrainCS224n] Pretraining, Stanford CS224n https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture9-pretraining.pdf
-
[LLaMa] LLaMA: Open and Efficient Foundation Language Models, Hugo Touvron, et al. https://arxiv.org/abs/2302.13971
Infrastructure for Parallel/ Distributed Training of Foundation Models
-
[Megatron] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, Mohammad Shoeybi, et al. https://arxiv.org/abs/1909.08053
-
[DeepSpeedZeRO] DeepSpeed and ZeRO. https://www.deepspeed.ai/tutorials/zero/
-
[OPT] Training Processs Sec2.5 of OPT: Open Pre-trained Transformer Language Models, Susan Zhang, et al. https://arxiv.org/abs/2205.01068/ also see full logbook of training
Emerging Behavior and Capabilities; Adaptation, Alignment of Foundation models for new tasks and data domain
-
[UnderstandInContext] How does in-context learning work? A framework for understanding the differences from traditional supervised learning. https://ai.stanford.edu/blog/understanding-incontext/
-
[ZeroShot] Finetuned Language Models Are Zero-Shot Learners, Jason Wei, et al. https://arxiv.org/abs/2109.01652
-
[UnsuperMulti] Language Models are Unsupervised Multitask Learners, Alec Radford, et al. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
-
[CoT] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Jason Wei, et al. https://arxiv.org/abs/2201.11903
-
[HumanFeedback] Training language models to follow instructions with human feedback, Long Ouyang, et al. https://arxiv.org/abs/2203.02155
-
[PromptEngg] Prompt Engineering. https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/
-
[LLaMa2] Llama 2: Open Foundation and Fine-Tuned Chat Models, Hugo Touvron, et al. https://arxiv.org/abs/2307.09288
-
[IntrepreTrans] Intrepretability with Transformer by Chris Olah. https://www.youtube.com/live/pC4zRb_5noQ?t=26s
-
[LoRA] LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu, et al. https://arxiv.org/abs/2106.09685
-
[MEND] Fast Model Editing at Scale, Eric Mitchell, et al. https://arxiv.org/abs/2110.11309
Teach LLMs to Reason
-
[CoTw/oPrompt] Chain-of-Thought Reasoning Without Prompting, Xuezhi Wang, et al. https://arxiv.org/abs/2402.10200
-
[LLMSelfCorrect] Large Language Models Cannot Self-Correct Reasoning Yet, Jie Huang, et al. https://arxiv.org/abs/2310.01798
-
[PremiseOrder] Premise Order Matters in Reasoning with Large Language Models, Xinyun Chen, et al. https://arxiv.org/abs/2402.08939
-
[CoTSerialProb] Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, Zhiyuan Li, et al. https://arxiv.org/abs/2402.12875
-
[SuperAlign] Introducing Superalignment. https://openai.com/index/introducing-superalignment/
-
[LearnReason] Learning to Reason, Insights from Language Modeling, by Noah Goodman. https://www.youtube.com/watch?v=YR9EztOF0R8
-
[MisguidAtt] Misguided Attention. https://github.com/cpldcpu/MisguidedAttention
-
[MemoReason] On Memorization of Large Language Models in Logical Reasoning, Chulin Xie, et al. https://arxiv.org/abs/2410.23123
Augmentation for Foundation models
-
[RAG] Retrieval-Augmented Generation for Large Language Models: A Survey, Yunfan Gao, et al. https://arxiv.org/abs/2312.10997
-
[GraphRAG] GraphRAG https://www.microsoft.com/en-us/research/project/graphrag/
-
[GraphFoundaModel] Towards Graph Foundation Models: A Survey and Beyond, Jiawei Liu, et al. https://arxiv.org/pdf/2310.11829
Applications of Foundation Models ; Beyond LLMs
-
[ProgSynthesis] Program Synthesis with Large Language Models, Jacob Austin, et al. https://arxiv.org/pdf/2108.07732
-
[AlphaCode] Competition-level code generation with AlphaCode, Yujia Li, et al. https://www.science.org/doi/full/10.1126/science.abq1158
-
[Codex] Evaluating Large Language Models Trained on Code, Mark Chen, et al. https://arxiv.org/abs/2107.03374
-
[ImageTrans] Image Transformer, Niki Parmar, et al. https://arxiv.org/pdf/1802.05751
-
[TransferableVisual] Learning Transferable Visual Models From Natural Language Supervision, Alec Radford, et al. https://arxiv.org/abs/2103.00020
-
[Text2Image] Zero-Shot Text-to-Image Generation, Aditya Ramesh, et al. https://arxiv.org/abs/2102.12092
-
[TransferLimit] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel, et al. https://colinraffel.com/publications/jmlr2020exploring.pdf
-
[Flamingo] Flamingo: a Visual Language Model for Few-Shot Learning Jean-Baptiste Alayrac, et al. https://arxiv.org/abs/2204.14198
-
[DecisionTrans] Decision Transformer: Reinforcement Learning via Sequence Modeling, Lili Chen, et al. https://arxiv.org/abs/2106.01345, https://www.youtube.com/watch?v=83QN9S-0I84, https://www.youtube.com/watch?v=w4Bw8WYL8Ps
-
[GR00T] Project GR00T: A Blueprint for Generalist Robotics Jim Fan, NVIDIA. https://www.youtube.com/live/Qhxr0uVT2zs
-
[Voyager] Voyager: An Open-Ended Embodied Agent with Large Language Models, Guanzhi Wang, et al. https://voyager.minedojo.org/
-
[Eureka] Eureka: Human-Level Reward Design via Coding Large Language Models, Jason Ma, et al. https://eureka-research.github.io
-
[DrEureka] DrEureka: Language Model Guided Sim-To-Real Transfer, Jason Ma, et al. https://eureka-research.github.io/dr-eureka/
LLM Agents and Agentic Frameworks
-
[ReAct] ReAct: Synergizing Reasoning and Acting in Language Models, Shunyu Yao, et al. https://arxiv.org/abs/2210.03629
-
[CompoundAI] The Shift from Models to Compound AI Systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
-
[FutureCompoundAI] Large Language Models Get the Hype, but Compound Systems Are the Future of AI. https://www.youtube.com/watch?v=vRTcE19M-KE&t=959s&ab_channel=StanfordOnline
-
[GoogleAgent] “Agents” whitepaper from Google, which describes the Google’s approach in supporting Agentic applications: https://drive.google.com/file/d/1oEjiRCTbd54aSdB_eEe3UShxLBWK9xkt/view
-
[MicrosoftAutoGen] Microsoft’s Open-source Framework: AutoGen: http://llmagents-learning.org/slides/autogen.pdf,https://arxiv.org/pdf/2308.08155,https://microsoft.github.io/autogen/0.2/
-
[Smolagents] Smolagents: simple agents that write actions in code –This framework emphasizes the line of asking LLM agents to generate “codes” to drive actions. https://huggingface.co/docs/smolagents/index
Paradigm shift in the development and adoption of AI applications/Systems due to Foundation Models
-
[DSPy] DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines Omar Khattab, et al. https://arxiv.org/abs/2310.03714, https://dspy.ai/, https://github.com/stanfordnlp/dspy, http://llmagents-learning.org/slides/dspy_lec.pdf, https://www.youtube.com/live/JEMYuzrKLUw
-
[LangChain] LangChain for LLM Application Development. https://learn.deeplearning.ai/courses/langchain/lesson/1/introduction, https://github.com/afondiel/LangChain-For-LLM-Application-Dev-DeepLearningAI?tab=readme-ov-file
-
[OpenAPIPrompt] Best practices for prompt engineering with the OpenAI API. https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api
AI Safety and Security
-
[MLSys79] AI Safety, RLHF, and Self-Supervision - Jared Kaplan, Stanford MLSys #79. https://www.youtube.com/watch?v=fqC3D-zNJUM&ab_channel=StanfordMLSysSeminars
-
[ConstitutionalAI] Constitutional AI: Harmlessness from AI Feedback. https://arxiv.org/pdf/2212.08073
-
[AgentSafety] Towards Building Safe & Trustworthy AI Agents and A Path for Science- and Evidence-based AI Policy. https://rdi.berkeley.edu/llm-agents/assets/dawn-agent-safety.pdf