伊利亚·萨茨克维推荐的30篇基础论文的玩具实现。
Toy implementations of the 30 foundational papers recommended by Ilya Sutskever

原始链接: https://github.com/pageman/sutskever-30-implementations

这个仓库提供了伊利亚·苏茨克维尔推荐的30篇深度学习基础论文的NumPy实现——他认为这批论文涵盖了90%的核心深度学习知识。每个实现都优先考虑教育清晰度,避免使用深度学习框架,使用合成数据,并提供广泛的可视化和解释。 这些论文被分为基础概念(RNN、LSTM、剪枝)、架构与机制(Transformers、ResNets、GNN)、高级主题(VAEs、神经图灵机)以及理论与元学习(MDL原理、柯尔莫哥洛夫复杂度)。 主要亮点包括字符级RNN、LSTM、AlexNet、ResNet、Transformers以及像检索增强生成(RAG)这样的最新进展的实现。该项目还深入研究了像柯尔莫哥洛夫复杂度和不可逆性这样的理论概念,并以“咖啡自动机”论文为例。 代码设计用于通过Jupyter笔记本进行交互式学习,并提供适合初学者的学习路径来引导用户了解核心概念。对于任何寻求更深入地理解深度学习基础知识而无需大型框架复杂性的人来说,这是一个宝贵的资源。现在已经完成了全部30篇论文!

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Ilya Sutskever 推荐的 30 篇基础论文的玩具实现 (github.com/pageman) 21 分,由 auraham 41 分钟前发布 | 隐藏 | 过去 | 收藏 | 1 条评论 dan353hehe 10 分钟前 [–] 这看起来挺有趣的。但我找不到原始的 30 篇论文的链接?仓库中的链接把我带到一个需要我“编译”论文的网站。编辑:没关系,我找到了,它们隐藏在 README 的下方,有一个指向不同文档的链接。回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Comprehensive toy implementations of the 30 foundational papers recommended by Ilya Sutskever

Implementations Coverage Python

This repository contains detailed, educational implementations of the papers from Ilya Sutskever's famous reading list - the collection he told John Carmack would teach you "90% of what matters" in deep learning.

Progress: 30/30 papers (100%) - COMPLETE! 🎉

Each implementation:

  • ✅ Uses only NumPy (no deep learning frameworks) for educational clarity
  • ✅ Includes synthetic/bootstrapped data for immediate execution
  • ✅ Provides extensive visualizations and explanations
  • ✅ Demonstrates core concepts from each paper
  • ✅ Runs in Jupyter notebooks for interactive learning
# Navigate to the directory
cd sutskever-30-implementations

# Install dependencies
pip install numpy matplotlib scipy

# Run any notebook
jupyter notebook 02_char_rnn_karpathy.ipynb

Foundational Concepts (Papers 1-5)

# Paper Notebook Key Concepts
1 The First Law of Complexodynamics 01_complexity_dynamics.ipynb Entropy, Complexity Growth, Cellular Automata
2 The Unreasonable Effectiveness of RNNs 02_char_rnn_karpathy.ipynb Character-level models, RNN basics, Text generation
3 Understanding LSTM Networks 03_lstm_understanding.ipynb Gates, Long-term memory, Gradient flow
4 RNN Regularization 04_rnn_regularization.ipynb Dropout for sequences, Variational dropout
5 Keeping Neural Networks Simple 05_neural_network_pruning.ipynb MDL principle, Weight pruning, 90%+ sparsity

Architectures & Mechanisms (Papers 6-15)

# Paper Notebook Key Concepts
6 Pointer Networks 06_pointer_networks.ipynb Attention as pointer, Combinatorial problems
7 ImageNet/AlexNet 07_alexnet_cnn.ipynb CNNs, Convolution, Data augmentation
8 Order Matters: Seq2Seq for Sets 08_seq2seq_for_sets.ipynb Set encoding, Permutation invariance, Attention pooling
9 GPipe 09_gpipe.ipynb Pipeline parallelism, Micro-batching, Re-materialization
10 Deep Residual Learning (ResNet) 10_resnet_deep_residual.ipynb Skip connections, Gradient highways
11 Dilated Convolutions 11_dilated_convolutions.ipynb Receptive fields, Multi-scale
12 Neural Message Passing (GNNs) 12_graph_neural_networks.ipynb Graph networks, Message passing
13 Attention Is All You Need 13_attention_is_all_you_need.ipynb Transformers, Self-attention, Multi-head
14 Neural Machine Translation 14_bahdanau_attention.ipynb Seq2seq, Bahdanau attention
15 Identity Mappings in ResNet 15_identity_mappings_resnet.ipynb Pre-activation, Gradient flow

Advanced Topics (Papers 16-22)

# Paper Notebook Key Concepts
16 Relational Reasoning 16_relational_reasoning.ipynb Relation networks, Pairwise functions
17 Variational Lossy Autoencoder 17_variational_autoencoder.ipynb VAE, ELBO, Reparameterization trick
18 Relational RNNs 18_relational_rnn.ipynb Relational memory, Multi-head self-attention, Manual backprop (~1100 lines)
19 The Coffee Automaton 19_coffee_automaton.ipynb Irreversibility, Entropy, Arrow of time, Landauer's principle
20 Neural Turing Machines 20_neural_turing_machine.ipynb External memory, Differentiable addressing
21 Deep Speech 2 (CTC) 21_ctc_speech.ipynb CTC loss, Speech recognition
22 Scaling Laws 22_scaling_laws.ipynb Power laws, Compute-optimal training

Theory & Meta-Learning (Papers 23-30)

# Paper Notebook Key Concepts
23 MDL Principle 23_mdl_principle.ipynb Information theory, Model selection, Compression
24 Machine Super Intelligence 24_machine_super_intelligence.ipynb Universal AI, AIXI, Solomonoff induction, Intelligence measures, Self-improvement
25 Kolmogorov Complexity 25_kolmogorov_complexity.ipynb Compression, Algorithmic randomness, Universal prior
26 CS231n: CNNs for Visual Recognition 26_cs231n_cnn_fundamentals.ipynb Image classification pipeline, kNN/Linear/NN/CNN, Backprop, Optimization, Babysitting neural nets
27 Multi-token Prediction 27_multi_token_prediction.ipynb Multiple future tokens, Sample efficiency, 2-3x faster
28 Dense Passage Retrieval 28_dense_passage_retrieval.ipynb Dual encoders, MIPS, In-batch negatives
29 Retrieval-Augmented Generation 29_rag.ipynb RAG-Sequence, RAG-Token, Knowledge retrieval
30 Lost in the Middle 30_lost_in_middle.ipynb Position bias, Long context, U-shaped curve

These implementations cover the most influential papers and demonstrate core deep learning concepts:

  1. 02_char_rnn_karpathy.ipynb - Character-level RNN

    • Build RNN from scratch
    • Understand backpropagation through time
    • Generate text
  2. 03_lstm_understanding.ipynb - LSTM Networks

    • Implement forget/input/output gates
    • Visualize gate activations
    • Compare with vanilla RNN
  3. 04_rnn_regularization.ipynb - RNN Regularization

    • Variational dropout for RNNs
    • Proper dropout placement
    • Training improvements
  4. 05_neural_network_pruning.ipynb - Network Pruning & MDL

    • Magnitude-based pruning
    • Iterative pruning with fine-tuning
    • 90%+ sparsity with minimal loss
    • Minimum Description Length principle
  1. 07_alexnet_cnn.ipynb - CNNs & AlexNet

    • Convolutional layers from scratch
    • Max pooling and ReLU
    • Data augmentation techniques
  2. 10_resnet_deep_residual.ipynb - ResNet

    • Skip connections solve degradation
    • Gradient flow visualization
    • Identity mapping intuition
  3. 15_identity_mappings_resnet.ipynb - Pre-activation ResNet

    • Pre-activation vs post-activation
    • Better gradient flow
    • Training 1000+ layer networks
  4. 11_dilated_convolutions.ipynb - Dilated Convolutions

    • Multi-scale receptive fields
    • No pooling required
    • Semantic segmentation
  1. 14_bahdanau_attention.ipynb - Neural Machine Translation

    • Original attention mechanism
    • Seq2seq with alignment
    • Attention visualization
  2. 13_attention_is_all_you_need.ipynb - Transformers

    • Scaled dot-product attention
    • Multi-head attention
    • Positional encoding
    • Foundation of modern LLMs
  3. 06_pointer_networks.ipynb - Pointer Networks

    • Attention as selection
    • Combinatorial optimization
    • Variable output size
  4. 08_seq2seq_for_sets.ipynb - Seq2Seq for Sets

    • Permutation-invariant set encoder
    • Read-Process-Write architecture
    • Attention over unordered elements
    • Sorting and set operations
    • Comparison: order-sensitive vs order-invariant
  5. 09_gpipe.ipynb - GPipe Pipeline Parallelism

    • Model partitioning across devices
    • Micro-batching for pipeline utilization
    • F-then-B schedule (forward all, backward all)
    • Re-materialization (gradient checkpointing)
    • Bubble time analysis
    • Training models larger than single-device memory
  1. 12_graph_neural_networks.ipynb - Graph Neural Networks

    • Message passing framework
    • Graph convolutions
    • Molecular property prediction
  2. 16_relational_reasoning.ipynb - Relation Networks

    • Pairwise relational reasoning
    • Visual QA
    • Permutation invariance
  3. 18_relational_rnn.ipynb - Relational RNN

    • LSTM with relational memory
    • Multi-head self-attention across memory slots
    • Architecture demonstration (forward pass)
    • Sequential reasoning tasks
    • Section 11: Manual backpropagation implementation (~1100 lines)
    • Complete gradient computation for all components
    • Gradient checking with numerical verification
  4. 20_neural_turing_machine.ipynb - Memory-Augmented Networks

    • Content & location addressing
    • Differentiable read/write
    • External memory
  5. 21_ctc_speech.ipynb - CTC Loss & Speech Recognition

    • Connectionist Temporal Classification
    • Alignment-free training
    • Forward algorithm
  1. 17_variational_autoencoder.ipynb - VAE
    • Generative modeling
    • ELBO loss
    • Latent space visualization
  1. 27_multi_token_prediction.ipynb - Multi-Token Prediction

    • Predict multiple future tokens
    • 2-3x sample efficiency
    • Speculative decoding
    • Faster training & inference
  2. 28_dense_passage_retrieval.ipynb - Dense Retrieval

    • Dual encoder architecture
    • In-batch negatives
    • Semantic search
  3. 29_rag.ipynb - Retrieval-Augmented Generation

    • RAG-Sequence vs RAG-Token
    • Combining retrieval + generation
    • Knowledge-grounded outputs
  4. 30_lost_in_middle.ipynb - Long Context Analysis

    • Position bias in LLMs
    • U-shaped performance curve
    • Document ordering strategies
  1. 22_scaling_laws.ipynb - Scaling Laws

    • Power law relationships
    • Compute-optimal training
    • Performance prediction
  2. 23_mdl_principle.ipynb - Minimum Description Length

    • Information-theoretic model selection
    • Compression = Understanding
    • MDL vs AIC/BIC comparison
    • Neural network architecture selection
    • MDL-based pruning (connects to Paper 5)
    • Kolmogorov complexity preview
  3. 25_kolmogorov_complexity.ipynb - Kolmogorov Complexity

    • K(x) = shortest program generating x
    • Randomness = Incompressibility
    • Algorithmic probability (Solomonoff)
    • Universal prior for induction
    • Connection to Shannon entropy
    • Occam's Razor formalized
    • Theoretical foundation for ML
  4. 24_machine_super_intelligence.ipynb - Universal Artificial Intelligence

    • Formal theory of intelligence (Legg & Hutter)
    • Psychometric g-factor and universal intelligence Υ(π)
    • Solomonoff induction for sequence prediction
    • AIXI: Theoretically optimal RL agent
    • Monte Carlo AIXI (MC-AIXI) approximation
    • Kolmogorov complexity estimation
    • Intelligence measurement across environments
    • Recursive self-improvement dynamics
    • Intelligence explosion scenarios
    • 6 sections: from psychometrics to superintelligence
    • Connects Papers #23 (MDL), #25 (Kolmogorov), #8 (DQN)
  5. 01_complexity_dynamics.ipynb - Complexity & Entropy

    • Cellular automata (Rule 30)
    • Entropy growth
    • Irreversibility (basic introduction)
  6. 19_coffee_automaton.ipynb - The Coffee Automaton (Deep Dive)

    • Comprehensive exploration of irreversibility
    • Coffee mixing and diffusion processes
    • Entropy growth and coarse-graining
    • Phase space and Liouville's theorem
    • Poincaré recurrence theorem (will unmix after e^N time!)
    • Maxwell's demon and Landauer's principle
    • Computational irreversibility (one-way functions, hashing)
    • Information bottleneck in machine learning
    • Biological irreversibility (life and the 2nd law)
    • Arrow of time: fundamental vs emergent
    • 10 comprehensive sections exploring irreversibility across all scales
  7. 26_cs231n_cnn_fundamentals.ipynb - CS231n: Vision from First Principles

    • Complete vision pipeline in pure NumPy
    • k-Nearest Neighbors baseline
    • Linear classifiers (SVM and Softmax)
    • Optimization (SGD, Momentum, Adam, learning rate schedules)
    • 2-layer neural networks with backpropagation
    • Convolutional layers (conv, pool, ReLU)
    • Complete CNN architecture (Mini-AlexNet)
    • Visualization techniques (filters, saliency maps)
    • Transfer learning principles
    • Babysitting tips (sanity checks, hyperparameter tuning, monitoring)
    • 10 sections covering entire CS231n curriculum
    • Ties together Papers #7 (AlexNet), #10 (ResNet), #11 (Dilated Conv)
sutskever-30-implementations/
├── README.md                           # This file
├── PROGRESS.md                         # Implementation progress tracking
├── IMPLEMENTATION_TRACKS.md            # Detailed tracks for all 30 papers
│
├── 01_complexity_dynamics.ipynb        # Entropy & complexity
├── 02_char_rnn_karpathy.ipynb         # Vanilla RNN
├── 03_lstm_understanding.ipynb         # LSTM gates
├── 04_rnn_regularization.ipynb         # Dropout for RNNs
├── 05_neural_network_pruning.ipynb     # Pruning & MDL
├── 06_pointer_networks.ipynb           # Attention pointers
├── 07_alexnet_cnn.ipynb               # CNNs & AlexNet
├── 08_seq2seq_for_sets.ipynb          # Permutation-invariant sets
├── 09_gpipe.ipynb                     # Pipeline parallelism
├── 10_resnet_deep_residual.ipynb      # Residual connections
├── 11_dilated_convolutions.ipynb       # Multi-scale convolutions
├── 12_graph_neural_networks.ipynb      # Message passing GNNs
├── 13_attention_is_all_you_need.ipynb # Transformer architecture
├── 14_bahdanau_attention.ipynb         # Original attention
├── 15_identity_mappings_resnet.ipynb   # Pre-activation ResNet
├── 16_relational_reasoning.ipynb       # Relation networks
├── 17_variational_autoencoder.ipynb   # VAE
├── 18_relational_rnn.ipynb             # Relational RNN
├── 19_coffee_automaton.ipynb           # Irreversibility deep dive
├── 20_neural_turing_machine.ipynb     # External memory
├── 21_ctc_speech.ipynb                # CTC loss
├── 22_scaling_laws.ipynb              # Empirical scaling
├── 23_mdl_principle.ipynb             # MDL & compression
├── 24_machine_super_intelligence.ipynb # Universal AI & AIXI
├── 25_kolmogorov_complexity.ipynb     # K(x) & randomness
├── 26_cs231n_cnn_fundamentals.ipynb    # Vision from first principles
├── 27_multi_token_prediction.ipynb     # Multi-token prediction
├── 28_dense_passage_retrieval.ipynb    # Dense retrieval
├── 29_rag.ipynb                       # RAG architecture
└── 30_lost_in_middle.ipynb            # Long context analysis

All 30 papers implemented! (100% complete!) 🎉

Beginner Track (Start here!)

  1. Character RNN (02_char_rnn_karpathy.ipynb) - Learn basic RNNs
  2. LSTM (03_lstm_understanding.ipynb) - Understand gating mechanisms
  3. CNNs (07_alexnet_cnn.ipynb) - Computer vision fundamentals
  4. ResNet (10_resnet_deep_residual.ipynb) - Skip connections
  5. VAE (17_variational_autoencoder.ipynb) - Generative models
  1. RNN Regularization (04_rnn_regularization.ipynb) - Better training
  2. Bahdanau Attention (14_bahdanau_attention.ipynb) - Attention basics
  3. Pointer Networks (06_pointer_networks.ipynb) - Attention as selection
  4. Seq2Seq for Sets (08_seq2seq_for_sets.ipynb) - Permutation invariance
  5. CS231n (26_cs231n_cnn_fundamentals.ipynb) - Complete vision pipeline (kNN → CNNs)
  6. GPipe (09_gpipe.ipynb) - Pipeline parallelism for large models
  7. Transformers (13_attention_is_all_you_need.ipynb) - Modern architecture
  8. Dilated Convolutions (11_dilated_convolutions.ipynb) - Receptive fields
  9. Scaling Laws (22_scaling_laws.ipynb) - Understanding scale
  1. Pre-activation ResNet (15_identity_mappings_resnet.ipynb) - Architecture details
  2. Graph Neural Networks (12_graph_neural_networks.ipynb) - Graph learning
  3. Relation Networks (16_relational_reasoning.ipynb) - Relational reasoning
  4. Neural Turing Machines (20_neural_turing_machine.ipynb) - External memory
  5. CTC Loss (21_ctc_speech.ipynb) - Speech recognition
  6. Dense Retrieval (28_dense_passage_retrieval.ipynb) - Semantic search
  7. RAG (29_rag.ipynb) - Retrieval-augmented generation
  8. Lost in the Middle (30_lost_in_middle.ipynb) - Long context analysis
  1. MDL Principle (23_mdl_principle.ipynb) - Model selection via compression
  2. Kolmogorov Complexity (25_kolmogorov_complexity.ipynb) - Randomness & information
  3. Complexity Dynamics (01_complexity_dynamics.ipynb) - Entropy & emergence
  4. Coffee Automaton (19_coffee_automaton.ipynb) - Deep dive into irreversibility

Key Insights from the Sutskever 30

  • RNN → LSTM: Gating solves vanishing gradients
  • Plain Networks → ResNet: Skip connections enable depth
  • RNN → Transformer: Attention enables parallelization
  • Fixed vocab → Pointers: Output can reference input
  • Attention: Differentiable selection mechanism
  • Residual Connections: Gradient highways
  • Gating: Learned information flow control
  • External Memory: Separate storage from computation
  • Scaling Laws: Performance predictably improves with scale
  • Regularization: Dropout, weight decay, data augmentation
  • Optimization: Gradient clipping, learning rate schedules
  • Compute-Optimal: Balance model size and training data
  • Information Theory: Compression, entropy, MDL
  • Complexity: Kolmogorov complexity, power laws
  • Generative Modeling: VAE, ELBO, latent spaces
  • Memory: Differentiable data structures

Implementation Philosophy

These implementations deliberately avoid PyTorch/TensorFlow to:

  • Deepen understanding: See what frameworks abstract away
  • Educational clarity: No magic, every operation explicit
  • Core concepts: Focus on algorithms, not framework APIs
  • Transferable knowledge: Principles apply to any framework

Each notebook generates its own data to:

  • Immediate execution: No dataset downloads required
  • Controlled experiments: Understand behavior on simple cases
  • Concept focus: Data doesn't obscure the algorithm
  • Rapid iteration: Modify and re-run instantly

Build on These Implementations

After understanding the core concepts, try:

  1. Scale up: Implement in PyTorch/JAX for real datasets
  2. Combine techniques: E.g., ResNet + Attention
  3. Modern variants:
    • RNN → GRU → Transformer
    • VAE → β-VAE → VQ-VAE
    • ResNet → ResNeXt → EfficientNet
  4. Applications: Apply to real problems

The Sutskever 30 points toward:

  • Scaling (bigger models, more data)
  • Efficiency (sparse models, quantization)
  • Capabilities (reasoning, multi-modal)
  • Understanding (interpretability, theory)

See IMPLEMENTATION_TRACKS.md for full citations and links

  • Stanford CS231n: Convolutional Neural Networks
  • Stanford CS224n: NLP with Deep Learning
  • MIT 6.S191: Introduction to Deep Learning

These implementations are educational and can be improved! Consider:

  • Adding more visualizations
  • Implementing missing papers
  • Improving explanations
  • Finding bugs
  • Adding comparisons with framework implementations

If you use these implementations in your work or teaching:

@misc{sutskever30implementations,
  title={Sutskever 30: Complete Implementation Suite},
  author={Paul "The Pageman" Pajo, pageman@gmail.com},
  year={2025},
  note={Educational implementations of Ilya Sutskever's recommended reading list, inspired by https://papercode.vercel.app/}
}

Educational use. See individual papers for original research citations.

  • Ilya Sutskever: For curating this essential reading list
  • Paper authors: For their foundational contributions
  • Community: For making these ideas accessible

Latest Additions (December 2025)

Recently Implemented (21 new papers!)

  • Paper 4: RNN Regularization (variational dropout)
  • Paper 5: Neural Network Pruning (MDL, 90%+ sparsity)
  • Paper 7: AlexNet (CNNs from scratch)
  • Paper 8: Seq2Seq for Sets (permutation invariance, attention pooling)
  • Paper 9: GPipe (pipeline parallelism, micro-batching, re-materialization)
  • Paper 19: The Coffee Automaton (deep dive into irreversibility, entropy, Landauer's principle)
  • Paper 26: CS231n (complete vision pipeline: kNN → CNN, all in NumPy)
  • Paper 11: Dilated Convolutions (multi-scale)
  • Paper 12: Graph Neural Networks (message passing)
  • Paper 14: Bahdanau Attention (original attention)
  • Paper 15: Identity Mappings ResNet (pre-activation)
  • Paper 16: Relational Reasoning (relation networks)
  • Paper 18: Relational RNNs (relational memory + Section 11: manual backprop ~1100 lines)
  • Paper 21: Deep Speech 2 (CTC loss)
  • Paper 23: MDL Principle (compression, model selection, connects to Papers 5 & 25)
  • Paper 24: Machine Super Intelligence (Universal AI, AIXI, Solomonoff induction, intelligence measures, recursive self-improvement)
  • Paper 25: Kolmogorov Complexity (randomness, algorithmic probability, theoretical foundation)
  • Paper 27: Multi-Token Prediction (2-3x sample efficiency)
  • Paper 28: Dense Passage Retrieval (dual encoders)
  • Paper 29: RAG (retrieval-augmented generation)
  • Paper 30: Lost in the Middle (long context)

Quick Reference: Implementation Complexity

Can Implement in an Afternoon

  • ✅ Character RNN
  • ✅ LSTM
  • ✅ ResNet
  • ✅ Simple VAE
  • ✅ Dilated Convolutions
  • ✅ Transformer
  • ✅ Pointer Networks
  • ✅ Graph Neural Networks
  • ✅ Relation Networks
  • ✅ Neural Turing Machine
  • ✅ CTC Loss
  • ✅ Dense Retrieval
  • ✅ Full RAG system
  • ⚠️ Large-scale experiments
  • ⚠️ Hyperparameter optimization

"If you really learn all of these, you'll know 90% of what matters today." - Ilya Sutskever

Happy learning! 🚀

联系我们 contact @ memedata.com