CS336：从零构建语言模型

CS336：从零构建语言模型
CS336: Language Modeling from Scratch

这门高强度的 5 学分课程旨在对大语言模型（LLM）的开发进行全面且实践性的深度探索。该课程参考了底层系统课程的模式，不局限于理论概念，而是要求学生从零开始构建 Transformer 模型。通过五项大型作业，学生将经历模型开发的完整生命周期：实现架构组件（分词器、优化器）、优化系统性能（基准测试与分布式训练）、管理数据流水线（清洗与过滤）、研究缩放定律（Scaling Laws），以及执行模型对齐（SFT 和 RL）。这是一门要求极高、侧重工程实践的课程。参与者必须具备扎实的 Python 编程能力、熟悉 PyTorch，并拥有微积分、线性代数、概率论和机器学习的深厚基础。学生需要以极少的辅助框架完成大量的编码工作。课程强调对 LLM 工程底层原理的深度理解，优先采用手动实现而非调用外部代码或使用 AI 辅助，以确保学习的严谨性。课程中涉及系统性能的作业需要 GPU 计算资源支持。

斯坦福大学的“CS336：从零构建语言模型”是一门备受推崇、严谨的课程，旨在教授学生如何从底层构建并优化大语言模型（LLM）。该课程提供了详尽的视频讲座、细致的作业，并深入探讨了分布式训练和对齐等前沿技术。 Hacker News 上的讨论强调了这门课程对深度学习从业者的价值，但许多人也指出该课程需要投入大量时间，且学习曲线陡峭。尽管课程要求严格，但参与者认为，通过亲手实践验证大语言模型流水线的每一个环节，能获得巨大的成就感。硬件需求是讨论的一个主要话题。虽然课程组为注册学生提供了高端 Blackwell GPU，但远程学习者使用本地设备（NVIDIA RTX 2060 及以上显卡或 M 系列芯片的 Mac）或经济实惠的云服务器也能进行学习。助教积极鼓励外部学生通过 GitHub 提供反馈，并认可了在环境配置和内存管理方面提供更好指导的必要性。对于那些希望在现代人工智能领域打下坚实技术与工程基础，且具备基本机器学习背景的学习者来说，该课程被广泛推荐。

Content

What is this course about?

Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpose system address a range of downstream tasks. As the field of artificial intelligence (AI), machine learning (ML), and NLP continues to grow, possessing a deep understanding of language models becomes essential for scientists and engineers alike. This course is designed to provide students with a comprehensive understanding of language models by walking them through the entire process of developing their own. Drawing inspiration from operating systems courses that create an entire operating system from scratch, we will lead students through every aspect of language model creation, including data collection and cleaning for pre-training, transformer model construction, model training, and evaluation before deployment.

Prerequisites

Proficiency in Python
The majority of class assignments will be in Python. Unlike most other AI classes, students will be given minimal scaffolding. The amount of code you will write will be at least an order of magnitude greater than for other classes. Therefore, being proficient in Python and software engineering is paramount.
Experience with deep learning and systems optimization
A significant part of the course will involve making neural language models run quickly and efficiently on GPUs across multiple machines. We expect students to be able to have a strong familiarity with PyTorch and know basic systems concepts like the memory hierarchy.
College Calculus, Linear Algebra (e.g. MATH 51, CME 100)
You should be comfortable understanding matrix/vector notation and operations.
Basic Probability and Statistics (e.g. CS 109 or equivalent)
You should know the basics of probabilities, Gaussian distributions, mean, standard deviation, etc.
Machine Learning (e.g. CS221, CS229, CS230, CS124, CS224N)
You should be comfortable with the basics of machine learning and deep learning.

Note that this is a 5-unit class. This is a very implementation-heavy class, so please allocate enough time for it.

Coursework

Assignments

Assignment 1: Basics
- Implement all of the components (tokenizer, model architecture, optimizer) necessary to train a standard Transformer language model.
- Train a minimal language model.
Assignment 2: Systems
- Profile and benchmark the model and layers from Assignment 1 using advanced tools, optimize Attention with your own Triton implementation of FlashAttention2.
- Build a memory-efficient, distributed version of the Assignment 1 model training code.
Assignment 3: Scaling
- Understand the function of each component of the Transformer.
- Query a training API to fit a scaling law to project model scaling.
Assignment 4: Data
- Convert raw Common Crawl dumps into usable pretraining data.
- Perform filtering and deduplication to improve model performance.
Assignment 5: Alignment and Reasoning RL
- Apply supervised finetuning and reinforcement learning to train LMs to reason when solving math problems.
- Optional Part 2: implement and apply safety alignment methods such as DPO.

All (currently tentative) deadlines are listed in the schedule.

GPU compute for self-study

If you are following along at home, you can access GPU compute from a cloud provider to complete the assignments.

Here are a few options (public pricing for a single B200 GPU on March 28, 2026):

For convenience and to save money, we recommend debugging correctness of your implementation on CPU first and then using GPU(s) (with the count recommended in the assignments) for completing training runs (A1, A4, A5) or benchmarking GPU operations (A2).

Honor code

Like all other classes at Stanford, we take the student Honor Code seriously. Please respect the following policies:

Collaboration: Study groups are allowed, but students must understand and complete their own assignments, and hand in one assignment per student. If you worked in a group, please put the names of the members of your study group at the top of your assignment. Please ask if you have any questions about the collaboration policy.
AI tools: Prompting LLMs such as ChatGPT is permitted for low-level programming questions or high-level conceptual questions about language models, but using it directly to solve the problem is prohibited. We strongly encourage you to disable AI autocomplete (e.g., Cursor Tab, GitHub CoPilot) in your IDE when completing assignments (though non-AI autocomplete, e.g., autocompleting function names is totally fine). We have found that AI autocomplete makes it much harder to engage deeply with the content. See the AI policy.
Existing code: Implementations for many of the things you will implement exist online. The handouts we'll give will be self-contained, so that you will not need to consult third-party code for producing your own implementation. Thus, you should not look at any existing code unless when otherwise specified in the handouts.

Submitting coursework

All coursework are submitted via Gradescope by the deadline. Do not submit your coursework via email.
If anything goes wrong, please ask a question in Slack or contact a course assistant.
You can submit as many times as you'd like until the deadline: we will only grade the last submission.
Partial work is better than not submitting any work.

Late days

Each student has 6 late days to use. A late day extends the deadline by 24 hours.
You can use up to 3 late days per assignment.

Regrade requests

If you believe that the course staff made an objective error in grading, you may submit a regrade request on Gradescope within 3 days after the grades are released.

Sponsor

We would like to thank Modal for sponsoring compute for this class.