TinyTinyTPU：部署在FPGA上的2x2收缩阵列TPU风格矩阵乘法单元

原文

A minimal 2×2 systolic-array TPU-style matrix-multiply unit, implemented in SystemVerilog and deployed on FPGA.

This project implements a complete TPU architecture including:

2×2 systolic array (4 processing elements)
Full post-MAC pipeline (accumulator, activation, normalization, quantization)
UART-based host interface
Multi-layer MLP inference capability
FPGA deployment on Basys3 (Xilinx Artix-7)

Resource Usage (Basys3 XC7A35T):

LUTs: ~1,000 (5% utilization)
Flip-Flops: ~1,000 (3% utilization)
DSP48E1: 8 slices
BRAM: ~10-15 blocks
Estimated Gate Count: ~25,000 gates

Project Overview
Quick Start
Simulation & Testing
FPGA Build & Deployment
Running Inference
Project Structure
Architecture Details
Open Source Tooling (Yosys/nextpnr)

TinyTinyTPU is an educational implementation of Google's TPU architecture, scaled down to a 2×2 systolic array. It demonstrates:

Systolic Array Architecture: Data flows horizontally (activations) and vertically (partial sums)
Diagonal Wavefront Weight Loading: Staggered weight capture for proper systolic timing
Full MLP Pipeline: Weight FIFO → MMU → Accumulator → Activation → Normalization → Quantization
Multi-Layer Inference: Supports sequential layer processing with double-buffered activations

This is a minimal, educational-scale TPU designed for:

Learning TPU architecture principles
Understanding systolic array dataflow
FPGA prototyping and experimentation
Small-scale ML inference (2×2 matrices)

For production workloads, scale up the array size (e.g., 256×256 like Google TPU v1).

For Simulation:

Verilator 5.022 or later
Python 3.8+
cocotb
GTKWave or Surfer (for waveform viewing)

For FPGA Build:

Xilinx Vivado 2020.1 or later (for Basys3)
OR Yosys + nextpnr (open source alternative, see Open Source Tooling)

For Running Inference:

Basys3 FPGA board
USB cable for programming
Python 3.8+ with pyserial

# Clone the repository
git clone <repository-url>
cd tinytinyTPU-co

# Set up simulation environment
cd sim
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

All simulation commands must be run from the sim/ directory:

cd sim

# Run all tests
make test

# Run all tests with waveform generation
make test WAVES=1

# Run specific module tests
make test_pe
make test_mmu
make test_mlp
make test_uart
make test_tpu_system

# Run with waveforms
make test_pe WAVES=1

Test File	Module	Coverage
`test_pe.py`	Processing Element	Reset, MAC operations, weight capture
`test_mmu.py`	2×2 Systolic Array	Weight loading, matrix multiply
`test_weight_fifo.py`	Weight FIFO	Push/pop, wraparound
`test_dual_weight_fifo.py`	Dual Weight FIFO	Column independence, skew timing
`test_accumulator.py`	Accumulator	Alignment, buffering, accumulate/overwrite modes
`test_activation_func.py`	Activation Function	ReLU positive/negative/zero cases
`test_normalizer.py`	Normalizer	Gain, bias, shift operations
`test_activation_pipeline.py`	Activation Pipeline	Full pipeline, saturation handling
`test_mlp_integration.py`	MLP Top	Multi-layer MLP inference
`test_uart_controller.py`	UART Controller	Command parsing, response generation
`test_tpu_system.py`	TPU Top	End-to-end system integration

# List available waveforms
make waves

# Open specific waveform
make waves MODULE=pe
make waves MODULE=mmu
make waves MODULE=mlp_top

Basys3 Pinout:

UART RX (B18): Receives commands from PC
UART TX (A18): Sends responses to PC
Clock: 100 MHz (onboard oscillator)
Reset: Center button (BTNC, U18)
LEDs: Status display (see fpga/README.md for LED modes)

UART Settings:

Baud Rate: 115200
Data Bits: 8
Parity: None
Stop Bits: 1

The project includes a Python driver for communicating with the FPGA:

cd host

# Basic inference demo
python3 inference_demo.py

# Gesture recognition demo (requires trained model)
python3 gesture_demo.py

# Interactive test
python3 test_tpu_driver.py

The inference_demo.py script demonstrates:

Loading weights into the TPU
Loading input activations
Executing inference
Reading results

Example Usage:

from tpu_driver import TPUDriver

# Connect to FPGA (adjust port as needed)
tpu = TPUDriver('/dev/ttyUSB0')  # Linux
# tpu = TPUDriver('COM3')         # Windows

# Load 2×2 weight matrix
weights = [[1, 2], [3, 4]]
tpu.write_weights(weights)

# Load 2×2 activation matrix
activations = [[5, 6], [7, 8]]
tpu.write_activations(activations)

# Execute inference
tpu.execute()

# Read results
result = tpu.read_result()
print(f"Result: {result}")

The gesture_demo.py script implements a simple gesture classifier:

Trains a 2-layer MLP on mouse movement data
Classifies gestures as "Horizontal" or "Vertical"
Real-time inference on FPGA

Running the Demo:

cd host
python3 gesture_demo.py

Model Training:

cd model
python3 train.py
# Generates: gesture_model.json

The TPU uses a simple byte-based UART protocol:

Commands:

0x01: Write Weight (4 bytes: W00, W01, W10, W11)
0x02: Write Activation (4 bytes: A00, A01, A10, A11)
0x03: Execute (start inference)
0x04: Read Result (returns 4 bytes: acc0[31:0])
0x05: Read Result Column 1 (returns 4 bytes: acc1[31:0])
0x06: Read Status (returns 1 byte: state[3:0] | cycle_cnt[3:0])

See host/tpu_driver.py for full protocol implementation.

tinytinyTPU-co/
├── rtl/                          # SystemVerilog RTL source files
│   ├── pe.sv                     # Processing Element (MAC unit)
│   ├── mmu.sv                    # 2×2 Matrix Multiply Unit (systolic array)
│   ├── weight_fifo.sv            # Single-column weight FIFO
│   ├── dual_weight_fifo.sv       # Dual-column weight FIFO with skew
│   ├── accumulator.sv            # Top-level accumulator
│   ├── accumulator_align.sv      # Column alignment logic
│   ├── accumulator_mem.sv        # Double-buffered accumulator memory
│   ├── activation_func.sv        # ReLU/ReLU6 activation
│   ├── normalizer.sv             # Gain/bias/shift normalization
│   ├── loss_block.sv             # L1 loss computation
│   ├── activation_pipeline.sv    # Full post-accumulator pipeline
│   ├── unified_buffer.sv          # Ready/valid output FIFO
│   ├── mlp_top.sv                # Top-level MLP integration
│   ├── tpu_bridge.sv              # UART-to-MLP bridge
│   ├── uart_controller.sv         # UART command processor
│   ├── uart_rx.sv                # UART receiver
│   ├── uart_tx.sv                # UART transmitter
│   └── tpu_top.sv                # Complete TPU system
│
├── sim/                          # Simulation environment
│   ├── Makefile                  # Build and test automation
│   ├── requirements.txt          # Python dependencies
│   ├── tests/                    # cocotb Python testbenches
│   │   ├── test_pe.py
│   │   ├── test_mmu.py
│   │   ├── test_weight_fifo.py
│   │   ├── test_dual_weight_fifo.py
│   │   ├── test_accumulator.py
│   │   ├── test_activation_func.py
│   │   ├── test_normalizer.py
│   │   ├── test_activation_pipeline.py
│   │   ├── test_mlp_integration.py
│   │   ├── test_uart_controller.py
│   │   └── test_tpu_system.py
│   └── waves/                    # Generated VCD waveforms
│
├── fpga/                         # FPGA deployment files
│   ├── basys3_top.sv             # Top-level FPGA wrapper
│   ├── basys3.xdc                # Pin constraints
│   ├── build_vivado.tcl          # Automated build script
│   ├── basys3_top.bit            # Generated bitstream
│   └── README.md                 # FPGA-specific documentation
│
├── host/                         # Python host interface
│   ├── tpu_driver.py             # TPU communication driver
│   ├── tpu_compiler.py           # Model compilation utilities
│   ├── inference_demo.py          # Basic inference demo
│   ├── gesture_demo.py           # Gesture recognition demo
│   └── test_tpu_driver.py        # Driver unit tests
│
├── model/                        # ML model training
│   ├── train.py                  # Model training script
│   └── gesture_model.json        # Trained model (JSON format)
│
└── README.md                     # This file

PE00 -> PE01    Activations flow horizontally (right)
  |       |     
PE10 -> PE11    Partial sums flow vertically (down)
  |       |
acc0    acc1    Outputs to accumulator

Weight Loading (Diagonal Wavefront):

Cycle 0: W10 → col0, no capture
Cycle 1: W00 → col0 (capture), W11 → col1 (no capture)
Cycle 2: W01 → col1 (capture)

Activation Flow:

Row 0: A00 → PE00 → PE01
Row 1: A10 → PE10 → PE11 (with 1-cycle skew)

Weight FIFO: Stores weights, outputs with column skew
MMU (Systolic Array): Matrix multiply-accumulate
Accumulator: Aligns columns, double-buffered storage
Activation Pipeline:
- Activation function (ReLU/ReLU6)
- Normalization (gain × bias + shift)
- Quantization (int8 with saturation)
Unified Buffer: Output FIFO with ready/valid handshaking

The MLP controller manages sequential layer processing:

State Machine:
IDLE → LOAD_WEIGHT → LOAD_ACT → COMPUTE → DRAIN → TRANSFER → NEXT_LAYER → WAIT_WEIGHTS → ...

Double Buffering: Activations ping-pong between buffers for layer-to-layer transfer
Weight Loading: Weights loaded per layer via UART
Pipeline Overlap: While layer N drains, layer N+1 weights can be loaded

Open Source Tooling (Yosys/nextpnr)

While Vivado is the standard toolchain for Xilinx FPGAs, open-source alternatives exist:

Yosys: Synthesis (RTL → netlist)
nextpnr: Place & Route (netlist → bitstream)

Installation (Ubuntu/Debian):

# Install Yosys
sudo apt-get install yosys

# Install nextpnr (for Xilinx 7-series)
# Requires building from source - see nextpnr documentation
git clone https://github.com/YosysHQ/nextpnr.git
cd nextpnr
cmake . -DARCH=xilinx
make -j$(nproc)
sudo make install

Installation (macOS):

brew install yosys
# nextpnr requires manual build

Building with Yosys/nextpnr

Step 1: Synthesis (Yosys)

cd fpga

# Create synthesis script
cat > synth.ys << 'EOF'
# Read RTL files
read_verilog -sv ../rtl/pe.sv
read_verilog -sv ../rtl/mmu.sv
read_verilog -sv ../rtl/weight_fifo.sv
read_verilog -sv ../rtl/dual_weight_fifo.sv
read_verilog -sv ../rtl/accumulator_align.sv
read_verilog -sv ../rtl/accumulator_mem.sv
read_verilog -sv ../rtl/accumulator.sv
read_verilog -sv ../rtl/activation_func.sv
read_verilog -sv ../rtl/normalizer.sv
read_verilog -sv ../rtl/loss_block.sv
read_verilog -sv ../rtl/activation_pipeline.sv
read_verilog -sv ../rtl/unified_buffer.sv
read_verilog -sv ../rtl/mlp_top.sv
read_verilog -sv ../rtl/uart_rx.sv
read_verilog -sv ../rtl/uart_tx.sv
read_verilog -sv ../rtl/uart_controller.sv
read_verilog -sv ../rtl/tpu_bridge.sv
read_verilog -sv ../rtl/tpu_top.sv
read_verilog -sv basys3_top.sv

# Set top module
hierarchy -top basys3_top

# Synthesize
synth_xilinx -top basys3_top -family xc7

# Write netlist
write_verilog basys3_top_synth.v
write_json basys3_top.json
EOF

# Run synthesis
yosys synth.ys

Step 2: Place & Route (nextpnr)

# Generate bitstream
nextpnr-xilinx \
    --xdc basys3.xdc \
    --json basys3_top.json \
    --write basys3_top_routed.json \
    --fasm basys3_top.fasm

# Generate bitstream (requires Xilinx tools or open-source fasm2bit)
# Note: fasm2bit conversion may require Xilinx tools or open-source alternatives

The project includes a TCL script for automated Vivado builds:

cd fpga

# Build bitstream (synthesis + implementation + bitgen)
vivado -mode batch -source build_vivado.tcl

# Expected build time: 5-10 minutes
# Output: basys3_top.bit

Build Script Details:

Creates Vivado project: vivado_project/tinytinyTPU_basys3
Synthesizes all RTL files from ../rtl/
Implements design with timing constraints
Generates bitstream: basys3_top.bit
Creates reports: utilization, timing, DRC

Resource Utilization (Post-Implementation):

Check vivado_project/tinytinyTPU_basys3.runs/impl_1/utilization_post_impl.rpt
Check vivado_project/tinytinyTPU_basys3.runs/impl_1/timing_summary_post_impl.rpt

Via Vivado Hardware Manager (GUI):

Connect Basys3 board via USB
Open Vivado
Open Hardware Manager
Auto-connect to target
Program with basys3_top.bit

Via Command Line:

vivado -mode tcl
open_hw_manager
connect_hw_server
open_hw_target
set_property PROGRAM.FILE {basys3_top.bit} [get_hw_devices xc7a35t_0]
program_hw_devices [get_hw_devices xc7a35t_0]

Via OpenOCD (Alternative):

# If using OpenOCD with Digilent cable
openocd -f interface/ftdi/digilent_jtag_hs3.cfg -f target/xc7a35t.cfg
# Then use GDB or other tools to program

Limitations & Considerations

Current Status:

Yosys synthesis works well for most SystemVerilog constructs
nextpnr supports Xilinx 7-series but may have timing/routing challenges
Bitstream generation (fasm2bit) may require Xilinx tools or open-source alternatives

Recommendations:

For development: Use Vivado for reliable builds
For open-source exploration: Use Yosys for synthesis, verify with Vivado
For production: Stick with Vivado until open-source toolchain matures

Future Work:

Create automated Yosys/nextpnr build script
Document fasm2bit conversion process
Benchmark open-source vs. Vivado results

Verilator Errors:

Ensure Verilator 5.022+ is installed
Check SystemVerilog syntax (use make lint)

Test Failures:

Run with WAVES=1 to generate waveforms for debugging
Check sim/test_output.log for detailed error messages

Synthesis Errors:

Check RTL files are in rtl/ directory
Verify SystemVerilog syntax (Vivado may be stricter than Verilator)

Timing Violations:

Check timing_summary_post_impl.rpt
May need to add pipeline stages or reduce clock frequency

Place & Route Failures:

Check utilization reports
Verify constraints in basys3.xdc

UART Not Working:

Verify COM port: ls /dev/ttyUSB* (Linux) or Device Manager (Windows)
Check baud rate: 115200
Verify TX/RX pins in constraints file

LEDs Not Responding:

Check bitstream programmed correctly
Verify reset button (center button)
Check switch settings for LED modes (see fpga/README.md)

Contributions welcome! Areas for improvement:

Additional test coverage
Performance optimizations
Documentation improvements
Open-source toolchain support
Larger array sizes

Inspired by Google's TPU architecture (thank you Cliff and Richard for your time!)
The boys from the TinyTPU team!!
Edmund and the Yosys / Symbiotic EDA crew
Stanford FAF for the support, funding, and community!
Princeton ECE Dept for the Basys 3 to play around with :)