Comprehensive toy implementations of the 30 foundational papers recommended by Ilya Sutskever
This repository contains detailed, educational implementations of the papers from Ilya Sutskever's famous reading list - the collection he told John Carmack would teach you "90% of what matters" in deep learning.
Progress: 30/30 papers (100%) - COMPLETE! 🎉
Each implementation:
- ✅ Uses only NumPy (no deep learning frameworks) for educational clarity
- ✅ Includes synthetic/bootstrapped data for immediate execution
- ✅ Provides extensive visualizations and explanations
- ✅ Demonstrates core concepts from each paper
- ✅ Runs in Jupyter notebooks for interactive learning
# Navigate to the directory
cd sutskever-30-implementations
# Install dependencies
pip install numpy matplotlib scipy
# Run any notebook
jupyter notebook 02_char_rnn_karpathy.ipynb| # | Paper | Notebook | Key Concepts |
|---|---|---|---|
| 1 | The First Law of Complexodynamics | ✅ 01_complexity_dynamics.ipynb |
Entropy, Complexity Growth, Cellular Automata |
| 2 | The Unreasonable Effectiveness of RNNs | ✅ 02_char_rnn_karpathy.ipynb |
Character-level models, RNN basics, Text generation |
| 3 | Understanding LSTM Networks | ✅ 03_lstm_understanding.ipynb |
Gates, Long-term memory, Gradient flow |
| 4 | RNN Regularization | ✅ 04_rnn_regularization.ipynb |
Dropout for sequences, Variational dropout |
| 5 | Keeping Neural Networks Simple | ✅ 05_neural_network_pruning.ipynb |
MDL principle, Weight pruning, 90%+ sparsity |
| # | Paper | Notebook | Key Concepts |
|---|---|---|---|
| 6 | Pointer Networks | ✅ 06_pointer_networks.ipynb |
Attention as pointer, Combinatorial problems |
| 7 | ImageNet/AlexNet | ✅ 07_alexnet_cnn.ipynb |
CNNs, Convolution, Data augmentation |
| 8 | Order Matters: Seq2Seq for Sets | ✅ 08_seq2seq_for_sets.ipynb |
Set encoding, Permutation invariance, Attention pooling |
| 9 | GPipe | ✅ 09_gpipe.ipynb |
Pipeline parallelism, Micro-batching, Re-materialization |
| 10 | Deep Residual Learning (ResNet) | ✅ 10_resnet_deep_residual.ipynb |
Skip connections, Gradient highways |
| 11 | Dilated Convolutions | ✅ 11_dilated_convolutions.ipynb |
Receptive fields, Multi-scale |
| 12 | Neural Message Passing (GNNs) | ✅ 12_graph_neural_networks.ipynb |
Graph networks, Message passing |
| 13 | Attention Is All You Need | ✅ 13_attention_is_all_you_need.ipynb |
Transformers, Self-attention, Multi-head |
| 14 | Neural Machine Translation | ✅ 14_bahdanau_attention.ipynb |
Seq2seq, Bahdanau attention |
| 15 | Identity Mappings in ResNet | ✅ 15_identity_mappings_resnet.ipynb |
Pre-activation, Gradient flow |
| # | Paper | Notebook | Key Concepts |
|---|---|---|---|
| 16 | Relational Reasoning | ✅ 16_relational_reasoning.ipynb |
Relation networks, Pairwise functions |
| 17 | Variational Lossy Autoencoder | ✅ 17_variational_autoencoder.ipynb |
VAE, ELBO, Reparameterization trick |
| 18 | Relational RNNs | ✅ 18_relational_rnn.ipynb |
Relational memory, Multi-head self-attention, Manual backprop (~1100 lines) |
| 19 | The Coffee Automaton | ✅ 19_coffee_automaton.ipynb |
Irreversibility, Entropy, Arrow of time, Landauer's principle |
| 20 | Neural Turing Machines | ✅ 20_neural_turing_machine.ipynb |
External memory, Differentiable addressing |
| 21 | Deep Speech 2 (CTC) | ✅ 21_ctc_speech.ipynb |
CTC loss, Speech recognition |
| 22 | Scaling Laws | ✅ 22_scaling_laws.ipynb |
Power laws, Compute-optimal training |
| # | Paper | Notebook | Key Concepts |
|---|---|---|---|
| 23 | MDL Principle | ✅ 23_mdl_principle.ipynb |
Information theory, Model selection, Compression |
| 24 | Machine Super Intelligence | ✅ 24_machine_super_intelligence.ipynb |
Universal AI, AIXI, Solomonoff induction, Intelligence measures, Self-improvement |
| 25 | Kolmogorov Complexity | ✅ 25_kolmogorov_complexity.ipynb |
Compression, Algorithmic randomness, Universal prior |
| 26 | CS231n: CNNs for Visual Recognition | ✅ 26_cs231n_cnn_fundamentals.ipynb |
Image classification pipeline, kNN/Linear/NN/CNN, Backprop, Optimization, Babysitting neural nets |
| 27 | Multi-token Prediction | ✅ 27_multi_token_prediction.ipynb |
Multiple future tokens, Sample efficiency, 2-3x faster |
| 28 | Dense Passage Retrieval | ✅ 28_dense_passage_retrieval.ipynb |
Dual encoders, MIPS, In-batch negatives |
| 29 | Retrieval-Augmented Generation | ✅ 29_rag.ipynb |
RAG-Sequence, RAG-Token, Knowledge retrieval |
| 30 | Lost in the Middle | ✅ 30_lost_in_middle.ipynb |
Position bias, Long context, U-shaped curve |
These implementations cover the most influential papers and demonstrate core deep learning concepts:
-
02_char_rnn_karpathy.ipynb- Character-level RNN- Build RNN from scratch
- Understand backpropagation through time
- Generate text
-
03_lstm_understanding.ipynb- LSTM Networks- Implement forget/input/output gates
- Visualize gate activations
- Compare with vanilla RNN
-
04_rnn_regularization.ipynb- RNN Regularization- Variational dropout for RNNs
- Proper dropout placement
- Training improvements
-
05_neural_network_pruning.ipynb- Network Pruning & MDL- Magnitude-based pruning
- Iterative pruning with fine-tuning
- 90%+ sparsity with minimal loss
- Minimum Description Length principle
-
07_alexnet_cnn.ipynb- CNNs & AlexNet- Convolutional layers from scratch
- Max pooling and ReLU
- Data augmentation techniques
-
10_resnet_deep_residual.ipynb- ResNet- Skip connections solve degradation
- Gradient flow visualization
- Identity mapping intuition
-
15_identity_mappings_resnet.ipynb- Pre-activation ResNet- Pre-activation vs post-activation
- Better gradient flow
- Training 1000+ layer networks
-
11_dilated_convolutions.ipynb- Dilated Convolutions- Multi-scale receptive fields
- No pooling required
- Semantic segmentation
-
14_bahdanau_attention.ipynb- Neural Machine Translation- Original attention mechanism
- Seq2seq with alignment
- Attention visualization
-
13_attention_is_all_you_need.ipynb- Transformers- Scaled dot-product attention
- Multi-head attention
- Positional encoding
- Foundation of modern LLMs
-
06_pointer_networks.ipynb- Pointer Networks- Attention as selection
- Combinatorial optimization
- Variable output size
-
08_seq2seq_for_sets.ipynb- Seq2Seq for Sets- Permutation-invariant set encoder
- Read-Process-Write architecture
- Attention over unordered elements
- Sorting and set operations
- Comparison: order-sensitive vs order-invariant
-
09_gpipe.ipynb- GPipe Pipeline Parallelism- Model partitioning across devices
- Micro-batching for pipeline utilization
- F-then-B schedule (forward all, backward all)
- Re-materialization (gradient checkpointing)
- Bubble time analysis
- Training models larger than single-device memory
-
12_graph_neural_networks.ipynb- Graph Neural Networks- Message passing framework
- Graph convolutions
- Molecular property prediction
-
16_relational_reasoning.ipynb- Relation Networks- Pairwise relational reasoning
- Visual QA
- Permutation invariance
-
18_relational_rnn.ipynb- Relational RNN- LSTM with relational memory
- Multi-head self-attention across memory slots
- Architecture demonstration (forward pass)
- Sequential reasoning tasks
- Section 11: Manual backpropagation implementation (~1100 lines)
- Complete gradient computation for all components
- Gradient checking with numerical verification
-
20_neural_turing_machine.ipynb- Memory-Augmented Networks- Content & location addressing
- Differentiable read/write
- External memory
-
21_ctc_speech.ipynb- CTC Loss & Speech Recognition- Connectionist Temporal Classification
- Alignment-free training
- Forward algorithm
17_variational_autoencoder.ipynb- VAE- Generative modeling
- ELBO loss
- Latent space visualization
-
27_multi_token_prediction.ipynb- Multi-Token Prediction- Predict multiple future tokens
- 2-3x sample efficiency
- Speculative decoding
- Faster training & inference
-
28_dense_passage_retrieval.ipynb- Dense Retrieval- Dual encoder architecture
- In-batch negatives
- Semantic search
-
29_rag.ipynb- Retrieval-Augmented Generation- RAG-Sequence vs RAG-Token
- Combining retrieval + generation
- Knowledge-grounded outputs
-
30_lost_in_middle.ipynb- Long Context Analysis- Position bias in LLMs
- U-shaped performance curve
- Document ordering strategies
-
22_scaling_laws.ipynb- Scaling Laws- Power law relationships
- Compute-optimal training
- Performance prediction
-
23_mdl_principle.ipynb- Minimum Description Length- Information-theoretic model selection
- Compression = Understanding
- MDL vs AIC/BIC comparison
- Neural network architecture selection
- MDL-based pruning (connects to Paper 5)
- Kolmogorov complexity preview
-
25_kolmogorov_complexity.ipynb- Kolmogorov Complexity- K(x) = shortest program generating x
- Randomness = Incompressibility
- Algorithmic probability (Solomonoff)
- Universal prior for induction
- Connection to Shannon entropy
- Occam's Razor formalized
- Theoretical foundation for ML
-
24_machine_super_intelligence.ipynb- Universal Artificial Intelligence- Formal theory of intelligence (Legg & Hutter)
- Psychometric g-factor and universal intelligence Υ(π)
- Solomonoff induction for sequence prediction
- AIXI: Theoretically optimal RL agent
- Monte Carlo AIXI (MC-AIXI) approximation
- Kolmogorov complexity estimation
- Intelligence measurement across environments
- Recursive self-improvement dynamics
- Intelligence explosion scenarios
- 6 sections: from psychometrics to superintelligence
- Connects Papers #23 (MDL), #25 (Kolmogorov), #8 (DQN)
-
01_complexity_dynamics.ipynb- Complexity & Entropy- Cellular automata (Rule 30)
- Entropy growth
- Irreversibility (basic introduction)
-
19_coffee_automaton.ipynb- The Coffee Automaton (Deep Dive)- Comprehensive exploration of irreversibility
- Coffee mixing and diffusion processes
- Entropy growth and coarse-graining
- Phase space and Liouville's theorem
- Poincaré recurrence theorem (will unmix after e^N time!)
- Maxwell's demon and Landauer's principle
- Computational irreversibility (one-way functions, hashing)
- Information bottleneck in machine learning
- Biological irreversibility (life and the 2nd law)
- Arrow of time: fundamental vs emergent
- 10 comprehensive sections exploring irreversibility across all scales
-
26_cs231n_cnn_fundamentals.ipynb- CS231n: Vision from First Principles- Complete vision pipeline in pure NumPy
- k-Nearest Neighbors baseline
- Linear classifiers (SVM and Softmax)
- Optimization (SGD, Momentum, Adam, learning rate schedules)
- 2-layer neural networks with backpropagation
- Convolutional layers (conv, pool, ReLU)
- Complete CNN architecture (Mini-AlexNet)
- Visualization techniques (filters, saliency maps)
- Transfer learning principles
- Babysitting tips (sanity checks, hyperparameter tuning, monitoring)
- 10 sections covering entire CS231n curriculum
- Ties together Papers #7 (AlexNet), #10 (ResNet), #11 (Dilated Conv)
sutskever-30-implementations/
├── README.md # This file
├── PROGRESS.md # Implementation progress tracking
├── IMPLEMENTATION_TRACKS.md # Detailed tracks for all 30 papers
│
├── 01_complexity_dynamics.ipynb # Entropy & complexity
├── 02_char_rnn_karpathy.ipynb # Vanilla RNN
├── 03_lstm_understanding.ipynb # LSTM gates
├── 04_rnn_regularization.ipynb # Dropout for RNNs
├── 05_neural_network_pruning.ipynb # Pruning & MDL
├── 06_pointer_networks.ipynb # Attention pointers
├── 07_alexnet_cnn.ipynb # CNNs & AlexNet
├── 08_seq2seq_for_sets.ipynb # Permutation-invariant sets
├── 09_gpipe.ipynb # Pipeline parallelism
├── 10_resnet_deep_residual.ipynb # Residual connections
├── 11_dilated_convolutions.ipynb # Multi-scale convolutions
├── 12_graph_neural_networks.ipynb # Message passing GNNs
├── 13_attention_is_all_you_need.ipynb # Transformer architecture
├── 14_bahdanau_attention.ipynb # Original attention
├── 15_identity_mappings_resnet.ipynb # Pre-activation ResNet
├── 16_relational_reasoning.ipynb # Relation networks
├── 17_variational_autoencoder.ipynb # VAE
├── 18_relational_rnn.ipynb # Relational RNN
├── 19_coffee_automaton.ipynb # Irreversibility deep dive
├── 20_neural_turing_machine.ipynb # External memory
├── 21_ctc_speech.ipynb # CTC loss
├── 22_scaling_laws.ipynb # Empirical scaling
├── 23_mdl_principle.ipynb # MDL & compression
├── 24_machine_super_intelligence.ipynb # Universal AI & AIXI
├── 25_kolmogorov_complexity.ipynb # K(x) & randomness
├── 26_cs231n_cnn_fundamentals.ipynb # Vision from first principles
├── 27_multi_token_prediction.ipynb # Multi-token prediction
├── 28_dense_passage_retrieval.ipynb # Dense retrieval
├── 29_rag.ipynb # RAG architecture
└── 30_lost_in_middle.ipynb # Long context analysis
All 30 papers implemented! (100% complete!) 🎉
- Character RNN (
02_char_rnn_karpathy.ipynb) - Learn basic RNNs - LSTM (
03_lstm_understanding.ipynb) - Understand gating mechanisms - CNNs (
07_alexnet_cnn.ipynb) - Computer vision fundamentals - ResNet (
10_resnet_deep_residual.ipynb) - Skip connections - VAE (
17_variational_autoencoder.ipynb) - Generative models
- RNN Regularization (
04_rnn_regularization.ipynb) - Better training - Bahdanau Attention (
14_bahdanau_attention.ipynb) - Attention basics - Pointer Networks (
06_pointer_networks.ipynb) - Attention as selection - Seq2Seq for Sets (
08_seq2seq_for_sets.ipynb) - Permutation invariance - CS231n (
26_cs231n_cnn_fundamentals.ipynb) - Complete vision pipeline (kNN → CNNs) - GPipe (
09_gpipe.ipynb) - Pipeline parallelism for large models - Transformers (
13_attention_is_all_you_need.ipynb) - Modern architecture - Dilated Convolutions (
11_dilated_convolutions.ipynb) - Receptive fields - Scaling Laws (
22_scaling_laws.ipynb) - Understanding scale
- Pre-activation ResNet (
15_identity_mappings_resnet.ipynb) - Architecture details - Graph Neural Networks (
12_graph_neural_networks.ipynb) - Graph learning - Relation Networks (
16_relational_reasoning.ipynb) - Relational reasoning - Neural Turing Machines (
20_neural_turing_machine.ipynb) - External memory - CTC Loss (
21_ctc_speech.ipynb) - Speech recognition - Dense Retrieval (
28_dense_passage_retrieval.ipynb) - Semantic search - RAG (
29_rag.ipynb) - Retrieval-augmented generation - Lost in the Middle (
30_lost_in_middle.ipynb) - Long context analysis
- MDL Principle (
23_mdl_principle.ipynb) - Model selection via compression - Kolmogorov Complexity (
25_kolmogorov_complexity.ipynb) - Randomness & information - Complexity Dynamics (
01_complexity_dynamics.ipynb) - Entropy & emergence - Coffee Automaton (
19_coffee_automaton.ipynb) - Deep dive into irreversibility
- RNN → LSTM: Gating solves vanishing gradients
- Plain Networks → ResNet: Skip connections enable depth
- RNN → Transformer: Attention enables parallelization
- Fixed vocab → Pointers: Output can reference input
- Attention: Differentiable selection mechanism
- Residual Connections: Gradient highways
- Gating: Learned information flow control
- External Memory: Separate storage from computation
- Scaling Laws: Performance predictably improves with scale
- Regularization: Dropout, weight decay, data augmentation
- Optimization: Gradient clipping, learning rate schedules
- Compute-Optimal: Balance model size and training data
- Information Theory: Compression, entropy, MDL
- Complexity: Kolmogorov complexity, power laws
- Generative Modeling: VAE, ELBO, latent spaces
- Memory: Differentiable data structures
These implementations deliberately avoid PyTorch/TensorFlow to:
- Deepen understanding: See what frameworks abstract away
- Educational clarity: No magic, every operation explicit
- Core concepts: Focus on algorithms, not framework APIs
- Transferable knowledge: Principles apply to any framework
Each notebook generates its own data to:
- Immediate execution: No dataset downloads required
- Controlled experiments: Understand behavior on simple cases
- Concept focus: Data doesn't obscure the algorithm
- Rapid iteration: Modify and re-run instantly
After understanding the core concepts, try:
- Scale up: Implement in PyTorch/JAX for real datasets
- Combine techniques: E.g., ResNet + Attention
- Modern variants:
- RNN → GRU → Transformer
- VAE → β-VAE → VQ-VAE
- ResNet → ResNeXt → EfficientNet
- Applications: Apply to real problems
The Sutskever 30 points toward:
- Scaling (bigger models, more data)
- Efficiency (sparse models, quantization)
- Capabilities (reasoning, multi-modal)
- Understanding (interpretability, theory)
See IMPLEMENTATION_TRACKS.md for full citations and links
- Stanford CS231n: Convolutional Neural Networks
- Stanford CS224n: NLP with Deep Learning
- MIT 6.S191: Introduction to Deep Learning
These implementations are educational and can be improved! Consider:
- Adding more visualizations
- Implementing missing papers
- Improving explanations
- Finding bugs
- Adding comparisons with framework implementations
If you use these implementations in your work or teaching:
@misc{sutskever30implementations,
title={Sutskever 30: Complete Implementation Suite},
author={Paul "The Pageman" Pajo, pageman@gmail.com},
year={2025},
note={Educational implementations of Ilya Sutskever's recommended reading list, inspired by https://papercode.vercel.app/}
}Educational use. See individual papers for original research citations.
- Ilya Sutskever: For curating this essential reading list
- Paper authors: For their foundational contributions
- Community: For making these ideas accessible
- ✅ Paper 4: RNN Regularization (variational dropout)
- ✅ Paper 5: Neural Network Pruning (MDL, 90%+ sparsity)
- ✅ Paper 7: AlexNet (CNNs from scratch)
- ✅ Paper 8: Seq2Seq for Sets (permutation invariance, attention pooling)
- ✅ Paper 9: GPipe (pipeline parallelism, micro-batching, re-materialization)
- ✅ Paper 19: The Coffee Automaton (deep dive into irreversibility, entropy, Landauer's principle)
- ✅ Paper 26: CS231n (complete vision pipeline: kNN → CNN, all in NumPy)
- ✅ Paper 11: Dilated Convolutions (multi-scale)
- ✅ Paper 12: Graph Neural Networks (message passing)
- ✅ Paper 14: Bahdanau Attention (original attention)
- ✅ Paper 15: Identity Mappings ResNet (pre-activation)
- ✅ Paper 16: Relational Reasoning (relation networks)
- ✅ Paper 18: Relational RNNs (relational memory + Section 11: manual backprop ~1100 lines)
- ✅ Paper 21: Deep Speech 2 (CTC loss)
- ✅ Paper 23: MDL Principle (compression, model selection, connects to Papers 5 & 25)
- ✅ Paper 24: Machine Super Intelligence (Universal AI, AIXI, Solomonoff induction, intelligence measures, recursive self-improvement)
- ✅ Paper 25: Kolmogorov Complexity (randomness, algorithmic probability, theoretical foundation)
- ✅ Paper 27: Multi-Token Prediction (2-3x sample efficiency)
- ✅ Paper 28: Dense Passage Retrieval (dual encoders)
- ✅ Paper 29: RAG (retrieval-augmented generation)
- ✅ Paper 30: Lost in the Middle (long context)
- ✅ Character RNN
- ✅ LSTM
- ✅ ResNet
- ✅ Simple VAE
- ✅ Dilated Convolutions
- ✅ Transformer
- ✅ Pointer Networks
- ✅ Graph Neural Networks
- ✅ Relation Networks
- ✅ Neural Turing Machine
- ✅ CTC Loss
- ✅ Dense Retrieval
- ✅ Full RAG system
⚠️ Large-scale experiments⚠️ Hyperparameter optimization
"If you really learn all of these, you'll know 90% of what matters today." - Ilya Sutskever
Happy learning! 🚀