A minimal 2×2 systolic-array TPU-style matrix-multiply unit, implemented in SystemVerilog and deployed on FPGA.
This project implements a complete TPU architecture including:
- 2×2 systolic array (4 processing elements)
- Full post-MAC pipeline (accumulator, activation, normalization, quantization)
- UART-based host interface
- Multi-layer MLP inference capability
- FPGA deployment on Basys3 (Xilinx Artix-7)
Resource Usage (Basys3 XC7A35T):

- LUTs: ~1,000 (5% utilization)
- Flip-Flops: ~1,000 (3% utilization)
- DSP48E1: 8 slices
- BRAM: ~10-15 blocks
- Estimated Gate Count: ~25,000 gates
- Project Overview
- Quick Start
- Simulation & Testing
- FPGA Build & Deployment
- Running Inference
- Project Structure
- Architecture Details
- Open Source Tooling (Yosys/nextpnr)
TinyTinyTPU is an educational implementation of Google's TPU architecture, scaled down to a 2×2 systolic array. It demonstrates:
- Systolic Array Architecture: Data flows horizontally (activations) and vertically (partial sums)
- Diagonal Wavefront Weight Loading: Staggered weight capture for proper systolic timing
- Full MLP Pipeline: Weight FIFO → MMU → Accumulator → Activation → Normalization → Quantization
- Multi-Layer Inference: Supports sequential layer processing with double-buffered activations
This is a minimal, educational-scale TPU designed for:
- Learning TPU architecture principles
- Understanding systolic array dataflow
- FPGA prototyping and experimentation
- Small-scale ML inference (2×2 matrices)
For production workloads, scale up the array size (e.g., 256×256 like Google TPU v1).
For Simulation:
- Verilator 5.022 or later
- Python 3.8+
- cocotb
- GTKWave or Surfer (for waveform viewing)
For FPGA Build:
- Xilinx Vivado 2020.1 or later (for Basys3)
- OR Yosys + nextpnr (open source alternative, see Open Source Tooling)
For Running Inference:
- Basys3 FPGA board
- USB cable for programming
- Python 3.8+ with pyserial
# Clone the repository
git clone <repository-url>
cd tinytinyTPU-co
# Set up simulation environment
cd sim
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txtAll simulation commands must be run from the sim/ directory:
cd sim
# Run all tests
make test
# Run all tests with waveform generation
make test WAVES=1
# Run specific module tests
make test_pe
make test_mmu
make test_mlp
make test_uart
make test_tpu_system
# Run with waveforms
make test_pe WAVES=1| Test File | Module | Coverage |
|---|---|---|
test_pe.py |
Processing Element | Reset, MAC operations, weight capture |
test_mmu.py |
2×2 Systolic Array | Weight loading, matrix multiply |
test_weight_fifo.py |
Weight FIFO | Push/pop, wraparound |
test_dual_weight_fifo.py |
Dual Weight FIFO | Column independence, skew timing |
test_accumulator.py |
Accumulator | Alignment, buffering, accumulate/overwrite modes |
test_activation_func.py |
Activation Function | ReLU positive/negative/zero cases |
test_normalizer.py |
Normalizer | Gain, bias, shift operations |
test_activation_pipeline.py |
Activation Pipeline | Full pipeline, saturation handling |
test_mlp_integration.py |
MLP Top | Multi-layer MLP inference |
test_uart_controller.py |
UART Controller | Command parsing, response generation |
test_tpu_system.py |
TPU Top | End-to-end system integration |
# List available waveforms
make waves
# Open specific waveform
make waves MODULE=pe
make waves MODULE=mmu
make waves MODULE=mlp_topBasys3 Pinout:
- UART RX (B18): Receives commands from PC
- UART TX (A18): Sends responses to PC
- Clock: 100 MHz (onboard oscillator)
- Reset: Center button (BTNC, U18)
- LEDs: Status display (see
fpga/README.mdfor LED modes)
UART Settings:
- Baud Rate: 115200
- Data Bits: 8
- Parity: None
- Stop Bits: 1
The project includes a Python driver for communicating with the FPGA:
cd host
# Basic inference demo
python3 inference_demo.py
# Gesture recognition demo (requires trained model)
python3 gesture_demo.py
# Interactive test
python3 test_tpu_driver.pyThe inference_demo.py script demonstrates:
- Loading weights into the TPU
- Loading input activations
- Executing inference
- Reading results
Example Usage:
from tpu_driver import TPUDriver
# Connect to FPGA (adjust port as needed)
tpu = TPUDriver('/dev/ttyUSB0') # Linux
# tpu = TPUDriver('COM3') # Windows
# Load 2×2 weight matrix
weights = [[1, 2], [3, 4]]
tpu.write_weights(weights)
# Load 2×2 activation matrix
activations = [[5, 6], [7, 8]]
tpu.write_activations(activations)
# Execute inference
tpu.execute()
# Read results
result = tpu.read_result()
print(f"Result: {result}")The gesture_demo.py script implements a simple gesture classifier:
- Trains a 2-layer MLP on mouse movement data
- Classifies gestures as "Horizontal" or "Vertical"
- Real-time inference on FPGA
Running the Demo:
cd host
python3 gesture_demo.pyModel Training:
cd model
python3 train.py
# Generates: gesture_model.jsonThe TPU uses a simple byte-based UART protocol:
Commands:
0x01: Write Weight (4 bytes: W00, W01, W10, W11)0x02: Write Activation (4 bytes: A00, A01, A10, A11)0x03: Execute (start inference)0x04: Read Result (returns 4 bytes: acc0[31:0])0x05: Read Result Column 1 (returns 4 bytes: acc1[31:0])0x06: Read Status (returns 1 byte: state[3:0] | cycle_cnt[3:0])
See host/tpu_driver.py for full protocol implementation.
tinytinyTPU-co/
├── rtl/ # SystemVerilog RTL source files
│ ├── pe.sv # Processing Element (MAC unit)
│ ├── mmu.sv # 2×2 Matrix Multiply Unit (systolic array)
│ ├── weight_fifo.sv # Single-column weight FIFO
│ ├── dual_weight_fifo.sv # Dual-column weight FIFO with skew
│ ├── accumulator.sv # Top-level accumulator
│ ├── accumulator_align.sv # Column alignment logic
│ ├── accumulator_mem.sv # Double-buffered accumulator memory
│ ├── activation_func.sv # ReLU/ReLU6 activation
│ ├── normalizer.sv # Gain/bias/shift normalization
│ ├── loss_block.sv # L1 loss computation
│ ├── activation_pipeline.sv # Full post-accumulator pipeline
│ ├── unified_buffer.sv # Ready/valid output FIFO
│ ├── mlp_top.sv # Top-level MLP integration
│ ├── tpu_bridge.sv # UART-to-MLP bridge
│ ├── uart_controller.sv # UART command processor
│ ├── uart_rx.sv # UART receiver
│ ├── uart_tx.sv # UART transmitter
│ └── tpu_top.sv # Complete TPU system
│
├── sim/ # Simulation environment
│ ├── Makefile # Build and test automation
│ ├── requirements.txt # Python dependencies
│ ├── tests/ # cocotb Python testbenches
│ │ ├── test_pe.py
│ │ ├── test_mmu.py
│ │ ├── test_weight_fifo.py
│ │ ├── test_dual_weight_fifo.py
│ │ ├── test_accumulator.py
│ │ ├── test_activation_func.py
│ │ ├── test_normalizer.py
│ │ ├── test_activation_pipeline.py
│ │ ├── test_mlp_integration.py
│ │ ├── test_uart_controller.py
│ │ └── test_tpu_system.py
│ └── waves/ # Generated VCD waveforms
│
├── fpga/ # FPGA deployment files
│ ├── basys3_top.sv # Top-level FPGA wrapper
│ ├── basys3.xdc # Pin constraints
│ ├── build_vivado.tcl # Automated build script
│ ├── basys3_top.bit # Generated bitstream
│ └── README.md # FPGA-specific documentation
│
├── host/ # Python host interface
│ ├── tpu_driver.py # TPU communication driver
│ ├── tpu_compiler.py # Model compilation utilities
│ ├── inference_demo.py # Basic inference demo
│ ├── gesture_demo.py # Gesture recognition demo
│ └── test_tpu_driver.py # Driver unit tests
│
├── model/ # ML model training
│ ├── train.py # Model training script
│ └── gesture_model.json # Trained model (JSON format)
│
└── README.md # This file
PE00 -> PE01 Activations flow horizontally (right)
| |
PE10 -> PE11 Partial sums flow vertically (down)
| |
acc0 acc1 Outputs to accumulator
Weight Loading (Diagonal Wavefront):
- Cycle 0: W10 → col0, no capture
- Cycle 1: W00 → col0 (capture), W11 → col1 (no capture)
- Cycle 2: W01 → col1 (capture)
Activation Flow:
- Row 0: A00 → PE00 → PE01
- Row 1: A10 → PE10 → PE11 (with 1-cycle skew)
- Weight FIFO: Stores weights, outputs with column skew
- MMU (Systolic Array): Matrix multiply-accumulate
- Accumulator: Aligns columns, double-buffered storage
- Activation Pipeline:
- Activation function (ReLU/ReLU6)
- Normalization (gain × bias + shift)
- Quantization (int8 with saturation)
- Unified Buffer: Output FIFO with ready/valid handshaking
The MLP controller manages sequential layer processing:
State Machine:
IDLE → LOAD_WEIGHT → LOAD_ACT → COMPUTE → DRAIN → TRANSFER → NEXT_LAYER → WAIT_WEIGHTS → ...
- Double Buffering: Activations ping-pong between buffers for layer-to-layer transfer
- Weight Loading: Weights loaded per layer via UART
- Pipeline Overlap: While layer N drains, layer N+1 weights can be loaded
While Vivado is the standard toolchain for Xilinx FPGAs, open-source alternatives exist:
- Yosys: Synthesis (RTL → netlist)
- nextpnr: Place & Route (netlist → bitstream)
Installation (Ubuntu/Debian):
# Install Yosys
sudo apt-get install yosys
# Install nextpnr (for Xilinx 7-series)
# Requires building from source - see nextpnr documentation
git clone https://github.com/YosysHQ/nextpnr.git
cd nextpnr
cmake . -DARCH=xilinx
make -j$(nproc)
sudo make installInstallation (macOS):
brew install yosys
# nextpnr requires manual buildStep 1: Synthesis (Yosys)
cd fpga
# Create synthesis script
cat > synth.ys << 'EOF'
# Read RTL files
read_verilog -sv ../rtl/pe.sv
read_verilog -sv ../rtl/mmu.sv
read_verilog -sv ../rtl/weight_fifo.sv
read_verilog -sv ../rtl/dual_weight_fifo.sv
read_verilog -sv ../rtl/accumulator_align.sv
read_verilog -sv ../rtl/accumulator_mem.sv
read_verilog -sv ../rtl/accumulator.sv
read_verilog -sv ../rtl/activation_func.sv
read_verilog -sv ../rtl/normalizer.sv
read_verilog -sv ../rtl/loss_block.sv
read_verilog -sv ../rtl/activation_pipeline.sv
read_verilog -sv ../rtl/unified_buffer.sv
read_verilog -sv ../rtl/mlp_top.sv
read_verilog -sv ../rtl/uart_rx.sv
read_verilog -sv ../rtl/uart_tx.sv
read_verilog -sv ../rtl/uart_controller.sv
read_verilog -sv ../rtl/tpu_bridge.sv
read_verilog -sv ../rtl/tpu_top.sv
read_verilog -sv basys3_top.sv
# Set top module
hierarchy -top basys3_top
# Synthesize
synth_xilinx -top basys3_top -family xc7
# Write netlist
write_verilog basys3_top_synth.v
write_json basys3_top.json
EOF
# Run synthesis
yosys synth.ysStep 2: Place & Route (nextpnr)
# Generate bitstream
nextpnr-xilinx \
--xdc basys3.xdc \
--json basys3_top.json \
--write basys3_top_routed.json \
--fasm basys3_top.fasm
# Generate bitstream (requires Xilinx tools or open-source fasm2bit)
# Note: fasm2bit conversion may require Xilinx tools or open-source alternativesThe project includes a TCL script for automated Vivado builds:
cd fpga
# Build bitstream (synthesis + implementation + bitgen)
vivado -mode batch -source build_vivado.tcl
# Expected build time: 5-10 minutes
# Output: basys3_top.bitBuild Script Details:
- Creates Vivado project:
vivado_project/tinytinyTPU_basys3 - Synthesizes all RTL files from
../rtl/ - Implements design with timing constraints
- Generates bitstream:
basys3_top.bit - Creates reports: utilization, timing, DRC
Resource Utilization (Post-Implementation):
- Check
vivado_project/tinytinyTPU_basys3.runs/impl_1/utilization_post_impl.rpt - Check
vivado_project/tinytinyTPU_basys3.runs/impl_1/timing_summary_post_impl.rpt
Via Vivado Hardware Manager (GUI):
- Connect Basys3 board via USB
- Open Vivado
- Open Hardware Manager
- Auto-connect to target
- Program with
basys3_top.bit
Via Command Line:
vivado -mode tcl
open_hw_manager
connect_hw_server
open_hw_target
set_property PROGRAM.FILE {basys3_top.bit} [get_hw_devices xc7a35t_0]
program_hw_devices [get_hw_devices xc7a35t_0]Via OpenOCD (Alternative):
# If using OpenOCD with Digilent cable
openocd -f interface/ftdi/digilent_jtag_hs3.cfg -f target/xc7a35t.cfg
# Then use GDB or other tools to programCurrent Status:
- Yosys synthesis works well for most SystemVerilog constructs
- nextpnr supports Xilinx 7-series but may have timing/routing challenges
- Bitstream generation (fasm2bit) may require Xilinx tools or open-source alternatives
Recommendations:
- For development: Use Vivado for reliable builds
- For open-source exploration: Use Yosys for synthesis, verify with Vivado
- For production: Stick with Vivado until open-source toolchain matures
Future Work:
- Create automated Yosys/nextpnr build script
- Document fasm2bit conversion process
- Benchmark open-source vs. Vivado results
Verilator Errors:
- Ensure Verilator 5.022+ is installed
- Check SystemVerilog syntax (use
make lint)
Test Failures:
- Run with
WAVES=1to generate waveforms for debugging - Check
sim/test_output.logfor detailed error messages
Synthesis Errors:
- Check RTL files are in
rtl/directory - Verify SystemVerilog syntax (Vivado may be stricter than Verilator)
Timing Violations:
- Check
timing_summary_post_impl.rpt - May need to add pipeline stages or reduce clock frequency
Place & Route Failures:
- Check utilization reports
- Verify constraints in
basys3.xdc
UART Not Working:
- Verify COM port:
ls /dev/ttyUSB*(Linux) or Device Manager (Windows) - Check baud rate: 115200
- Verify TX/RX pins in constraints file
LEDs Not Responding:
- Check bitstream programmed correctly
- Verify reset button (center button)
- Check switch settings for LED modes (see
fpga/README.md)
Contributions welcome! Areas for improvement:
- Additional test coverage
- Performance optimizations
- Documentation improvements
- Open-source toolchain support
- Larger array sizes
- Inspired by Google's TPU architecture (thank you Cliff and Richard for your time!)
- The boys from the TinyTPU team!!
- Edmund and the Yosys / Symbiotic EDA crew
- Stanford FAF for the support, funding, and community!
- Princeton ECE Dept for the Basys 3 to play around with :)