塔罗斯:深度卷积神经网络的硬件加速器
Talos: Hardware accelerator for deep convolutional neural networks

原始链接: https://talos.wtf/

## Talos架构:克服FPGA限制 Talos架构的开发很大程度上受到DE1-SoC FPGA物理限制的制约。最初尝试完全并行的CNN和最大池化实现失败,因为超过了FPGA的逻辑和路由容量。这迫使策略转变,优先考虑将设计*适配*到芯片上,而不是追求原始速度。 核心解决方案是**时分复用**:利用单个CNN和最大池化模块,循环四次——每次对应一个卷积核,由有限状态机控制。这使逻辑占用量与并行方法相比减半。进一步的优化包括**融合最大池化层和全连接层**,消除了中间数据存储并减少了资源使用。 最后,将神经元权重存储在片上**ROM块**中,而不是分布式逻辑中,显著减少了路由拥塞和整体资源利用率。实施了一种“**预处理**”机制来管理ROM延迟,确保计算数据的有效性。 这些迭代调整,由FPGA的限制驱动,最终实现了一个功能且高效的Talos架构,证明在硬件设计中,将解决方案适配到可用资源至关重要。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Talos: 用于深度卷积神经网络的硬件加速器 (talos.wtf) 14 分,由 llamatheollama 50分钟前发布 | 隐藏 | 过去 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

Talos went through a series of architecture evolutions. Working with low-level digital design and FPGAs isn't just about getting the math right. It is about getting the math right within the hard physical constraints of the DE1-SoC. The FPGA has a fixed number of logic array bits, a fixed amount of memory, and a fixed routing fabric. You can't negotiate with it. Every architectural tweak was forced by those limits.

The First Attempt: Brute Force

Our first attempt was a not-so-genius brute force, running all four cnn and maxpool instances, one for each kernel, simultaneously in parallel. Logically, this is the fastest possible approach. In practice, however, it blew up the DE1-SoC, consuming nearly 4× the available LABs on the chip, making the design too big for the fitter to physically route it. We also initially had 10 instances of a neuron module with a massive port connecting directly to the maxpool outputs. The sheer width of that bus created severe routing congestion, and Quartus threw fitter errors before we even got to timing analysis. The design was simply too big to put on a chip as small as Cyclone V.

The beauty of constraints is that it forces you to think. Think about why something doesn't work and if the approach itself is wrong. In software, you can often brute force your way through and worry about optimizing later. In hardware, however, if it doesn't fit, it doesn't ship.

The Pivot: Time vs Memory

Hardware forces you to choose: it's either insanely fast or takes a whole lot of circuitry. The tradeoff between speed and area is worth noting. If it doesn't fit on the chip, no matter how fast it is, it's useless. Keeping overall memory footprint in mind while squeezing every cycle out at the module level, we decided to use a time-multiplexing architecture. Instead of four parallel instances, we used only one cnn module and one maxpool module, and ran them consecutively four times, one for each kernel. This is the architecture Talos ships with.

This is handled by a finite state machine in the inference module that cycles through the following states:

Inference FSM — Time-Multiplexed Architecture

state: S_IDLEker_sel: 0pass: 1/4

enablecnn_completemp_completeker_sel < 3ker_sel == 3S_IDLEcnn_en=0 mp_en=0complete=0S_CLEARclear_accum ← 1ker_sel ← 0S_CNNcnn_en ← 1kernel = ker_bus[ker_sel]S_POOLmp_en ← 1pass_sel = ker_selS_GAPker_sel ← ker_sel + 1cnn_en←0 mp_en←0S_DONEcomplete ← 1neurons[0:9] → Q16.16
State transitions for the time-multiplexed inference control

It starts by setting clear_accum to high in S_CLEAR to reset all 10 neuron accumulators. Then for each pass, the state first changes to S_CNN that sets cnn_en to high, which starts the cnn module and runs the convolution with the kernel selected by ker_sel. Once cnn_complete becomes high, indicating that all kernel operations have been completed, the state moves to S_POOL, where it sets mp_en high, and runs the maxpool module for that pass. After mp_complete goes high, it hits S_GAP, increments ker_sel to go to next kernel, resets the internal buses, and loops back to S_CNN for the next kernel. The neuron accumulators are never cleared between passes and thus, they keep accumulating across all four runs, which is exactly how the weighted sum across 676 inputs of the fully connected layer is supposed to work. After ker_sel hits 3, indicating all neurons have been completed, and the final pass completes, the state goes S_DONE and sets complete to high.

This approach alone allowed us to reduce the LAB (Logic Array Block) footprint to almost half of the initial design, showing that we were indeed moving on the right path.

RTL layout mapping on the FPGA fabric after time-multiplexing optimizations

Fusing Maxpool and the Fully Connected Layer

With the time-multiplexing architecture in place, we had one more key module to consider: how do the pooled results go between maxpool and the fully connected layer? The naive answer is to store them and feed the FC layer once all four passes are done. While we did do exactly this, the design was still too big to fit on the Cyclone V.

In the end, it all came down to the math. Logically, the fully connected layer is just multiply-accumulate and there's no reason to hold onto the activations at all. So we sort of cheated our way (it worked!) and fused the two modules together. The moment maxpool computes a pooled value, it immediately multiplies it against all 10 neuron weights and accumulates directly into the neuron registers. No bus, no extra device resource usage.

Cutting Resource Usage Further: Weights in ROM

Even with a single CNN and maxpool instance, the fully connected layer weights were still a problem. While the design was able to synthesize within the LAB limit of the DE1-SoC, the fitter was unable to perform optimized routing due to design congestion. Storing all 676 weights per neuron as port arrays meant Quartus had to synthesize them into distributed logic, creating massive routing overhead and pushing device utilization a bit too high.

The fix was relatively simple: move every neuron's weights into M10K ROM blocks using Altera's altsyncram IP, initialized from .mif files at synthesis time. Each of the 10 neurons gets its own ROM, all sharing a single address bus. This single change dropped overall on-device resource utilization to roughly a third of what it was, turning a design that couldn't fit into one that routed cleanly with timing to spare.

Priming the Pipeline

One of the subtle complexities of hardware design is latency management. When our control logic requests a weight from the on-chip ROM, the data doesn't appear instantly. There is a one-cycle delay.

If the arithmetic unit tries to use the data immediately, it will calculate garbage. To solve this, we implemented a "priming" mechanism. The state machine issues the read address, waits (primes) for one cycle to allow the ROM to access the data, and only then enables the multiply-accumulate unit. This ensures that the math is always performed on valid data.

CLKADDRDATA_OUTPRIMEA0A1A2D(A0)D(A1)D(A2)1 Cycle LatencyWait for Data Validity
Cycle-accurate waveform trace of the memory prime mechanism resolving ROM latency
联系我们 contact @ memedata.com