Talos went through a series of architecture evolutions. Working with low-level digital design and FPGAs isn't just about getting the math right. It is about getting the math right within the hard physical constraints of the DE1-SoC. The FPGA has a fixed number of logic array bits, a fixed amount of memory, and a fixed routing fabric. You can't negotiate with it. Every architectural tweak was forced by those limits.
The First Attempt: Brute Force
Our first attempt was a not-so-genius brute force, running all four cnn and maxpool instances, one for each kernel, simultaneously in parallel. Logically, this is the fastest possible approach. In practice, however, it blew up the DE1-SoC, consuming nearly 4× the available LABs on the chip, making the design too big for the fitter to physically route it. We also initially had 10 instances of a neuron module with a massive port connecting directly to the maxpool outputs. The sheer width of that bus created severe routing congestion, and Quartus threw fitter errors before we even got to timing analysis. The design was simply too big to put on a chip as small as Cyclone V.
The beauty of constraints is that it forces you to think. Think about why something doesn't work and if the approach itself is wrong. In software, you can often brute force your way through and worry about optimizing later. In hardware, however, if it doesn't fit, it doesn't ship.
The Pivot: Time vs Memory
Hardware forces you to choose: it's either insanely fast or takes a whole lot of circuitry. The tradeoff between speed and area is worth noting. If it doesn't fit on the chip, no matter how fast it is, it's useless. Keeping overall memory footprint in mind while squeezing every cycle out at the module level, we decided to use a time-multiplexing architecture. Instead of four parallel instances, we used only one cnn module and one maxpool module, and ran them consecutively four times, one for each kernel. This is the architecture Talos ships with.
This is handled by a finite state machine in the inference module that cycles through the following states:
Inference FSM — Time-Multiplexed Architecture
state: S_IDLEker_sel: 0pass: 1/4
It starts by setting clear_accum to high in S_CLEAR to reset all 10 neuron accumulators. Then for each pass, the state first changes to S_CNN that sets cnn_en to high, which starts the cnn module and runs the convolution with the kernel selected by ker_sel. Once cnn_complete becomes high, indicating that all kernel operations have been completed, the state moves to S_POOL, where it sets mp_en high, and runs the maxpool module for that pass. After mp_complete goes high, it hits S_GAP, increments ker_sel to go to next kernel, resets the internal buses, and loops back to S_CNN for the next kernel. The neuron accumulators are never cleared between passes and thus, they keep accumulating across all four runs, which is exactly how the weighted sum across 676 inputs of the fully connected layer is supposed to work. After ker_sel hits 3, indicating all neurons have been completed, and the final pass completes, the state goes S_DONE and sets complete to high.
This approach alone allowed us to reduce the LAB (Logic Array Block) footprint to almost half of the initial design, showing that we were indeed moving on the right path.
Fusing Maxpool and the Fully Connected Layer
With the time-multiplexing architecture in place, we had one more key module to consider: how do the pooled results go between maxpool and the fully connected layer? The naive answer is to store them and feed the FC layer once all four passes are done. While we did do exactly this, the design was still too big to fit on the Cyclone V.
In the end, it all came down to the math. Logically, the fully connected layer is just multiply-accumulate and there's no reason to hold onto the activations at all. So we sort of cheated our way (it worked!) and fused the two modules together. The moment maxpool computes a pooled value, it immediately multiplies it against all 10 neuron weights and accumulates directly into the neuron registers. No bus, no extra device resource usage.
Cutting Resource Usage Further: Weights in ROM
Even with a single CNN and maxpool instance, the fully connected layer weights were still a problem. While the design was able to synthesize within the LAB limit of the DE1-SoC, the fitter was unable to perform optimized routing due to design congestion. Storing all 676 weights per neuron as port arrays meant Quartus had to synthesize them into distributed logic, creating massive routing overhead and pushing device utilization a bit too high.
The fix was relatively simple: move every neuron's weights into M10K ROM blocks using Altera's altsyncram IP, initialized from .mif files at synthesis time. Each of the 10 neurons gets its own ROM, all sharing a single address bus. This single change dropped overall on-device resource utilization to roughly a third of what it was, turning a design that couldn't fit into one that routed cleanly with timing to spare.
Priming the Pipeline
One of the subtle complexities of hardware design is latency management. When our control logic requests a weight from the on-chip ROM, the data doesn't appear instantly. There is a one-cycle delay.
If the arithmetic unit tries to use the data immediately, it will calculate garbage. To solve this, we implemented a "priming" mechanism. The state machine issues the read address, waits (primes) for one cycle to allow the ROM to access the data, and only then enables the multiply-accumulate unit. This ensures that the math is always performed on valid data.