（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41391822

在本文中，用户描述了在 Raspberry Pi Pico（一种紧凑型微控制器板）上使用 PIO 和 DMA 进行的实验。他们强调了使用这些技术进行快速以太网通信的好处，并指出能够以线路速度处理数据包，而不会给主 CPU 带来负担。用户注意到 Pico 包含一个 RP2040 片上系统 (SOC)，特别提到了其中的 PIO 功能。在接收端，它们利用每个数据包的中断来完成接收到的数据包，解决了以前与较大的文件数据包相比，系统以线速处理小语音数据包的速度较慢的问题。然而，他们建议实现一对 DMA 嗅探器来计算传输期间的校验和，从而减少消耗的处理能力。目前可以使用 PIO 执行此任务，尽管他们表示有兴趣提高此过程的效率。他们讨论了 Pico 功能的潜在应用，即充当附加的、功能较弱的微控制器的以太网控制器，并考虑将多个 Pico 组合起来以实现协作目的。此外，他们还探索利用这些功能创建轻量级 HTTP 服务器或逻辑分析器的可能性。最后，用户展示了他们关于传输速率和系统时钟速度的发现。传输速率似乎随着系统时钟频率的升高而非线性增加。该系统在 100 MHz 时的传输速率为 1.38 兆位/秒 (Mbit/s)，而在 200 MHz 时则达到 65.4 Mbit/s。他们预计新的 RP2350 SOC 会提高性能。

I just started playing around with PIO and DMA on a Pico, and it’s really fun just how much you can do on the chip without invoking the main CPU. For context, PIO is a mini-language you can program at the edge of the chip that can directly respond to and write to external IO. DMA allows you to tell the chip to send a signal based on data in memory, and can be programmed to loop or interrupt to limit re-invoking. The linked repo uses these heavily for its fast Ethernet communication.

"the Pico includes an RP2040 which is where the PIO runs" to me sounds like it implies either

- The original Pico was not built around the RP2040 as its central part ("includes" sounds to me like it was an addition)

- The Pico 2 includes a RP2040 (in addition to the RP2350) which runs PIO

Neither of which are true. I'm guessing some other people had a similar reaction.

> receive side uses a per-packet interrupt to finalize a received packet

This has made much faster systems not being able to process packets at line speed. A classic was that standard Gigabit network cards and contemporary CPUs were not able to process VoIP packets (which are tiny) at line speed, while they could easily download files (which are basically MTU-sized packets) at line speed.

Fortunately, the receive ISR isn't cracking packets, just calculating a checksum and passing the packet on to LWIP. I wish there were two DMA sniffers, so that the checksum could be calculated by the DMA engine(s), as that's where a lot of processor time is spent (event with a table driven CRC routine).

You can do it using PIO. I did that for emulating memory stick slave on rp2040. One PIO SM plus two dma channels with chained descriptors. XOR is achieved using any io reg you don’t need, with 0x3000 offset (manual mentions this as the XOR alias)

Luckily the RP2040 has a dualcore CPU so one core can be dedicated entirely to receiving the interrupts, passing it to user code on the other core via a FIFO or whatever else you fancy.

Why is the transfer rate non-linear with respect to the system clock? At 100 MHz the rate is 1.38 Mbit/s and at 200 Mhz it is 65.4 Mbit/s.

I expect the RP2350 to perform much better in this scenario! At the minimum, one of the DMA channels should be eliminated, and I'm hoping the CRC calculation will get faster.

I see some examples that show this can be used as a lite http daemon.

Is there enough room to have it control the ethernet port for another weaker or perhaps more powerful microcontroller?

Can you combine multiple picos with one being the ethernet stack and another that modifies certain packets?

Are there any other interesting things that can be done?

> Is there enough room to have it control the ethernet port for another weaker or perhaps more powerful microcontroller?

Well there is a whole unused core and plenty of built in SRAM. Seems like a good way to have an open-source version of Wiznet chips [1]. It could support full protocol offloading like Wiznet's or a lower-level raw packet sender/receiver like the ENC424J600.

[1] https://docs.wiznet.io/Product/iEthernet

I just quickly tried to fit the whole rp2040+ethernet phy in the WIZ850io formfactor (mainly because I already used that module in some projects before) and have not yet been able to make it fit without using the more expensive jlcpcb features like burried vias. It would be very cool to have though since the W5500 really needs an update.

I'm unable to respond to your deeper comment, but I don't see any issue at all with this. Your concern about the vias doesn't make sense as you just tent the vias anywhere you are concerned about shorts. I'm 100% certain you can fit both chips, all passives, etc, in this formfactor. If the flash size is a concern, RP2350 (the new version of the 2040) has integrated flash for some of their packages. Or just use a chip scale (or similar) flash instead of the one normally used on RP2040 designs.

A 4-layer in that form factor should be pretty doable with no fancy features like blind vias. The RP2040 and W5500 are the same size, and ethernet PHYs can be found in about 3x3mm or even smaller. There should be about 20x25mm of usable space in that module form factor (even conservatively, like 18x23).

I don't have the time to give it a shot myself, but I could try to help if needed.

The issue is more the space needed by all the passives, the crystal, the massive flash chip. I can just about make it fit but now I have the issue that the phy needs some vias to the center pad for gnd but that's always right at the point where my ethernet jack is on the other side.

Make a package that has a rp2050 mounted on a microSD and you've got a NAS that nobody will ever find.

Back when I was doing a dumb-server/smart-client desktop environment. Something like this would have been pretty cool. It needed a tiny API to save files, but the bulk of the environment worked as a static server.

This stuff all already exists, Raspberry Pi Zero 2 W. Board is slightly bigger than a Pico but has a full blown Linux system, 4 core arm64 cpu, 512MB ram, SD card slot, wifi, no ethernet though (add-ons are available). Or you could use a larger Pi.

Very impressive!

It would be interesting to see a short writeup of what kind of magic was required to achieve this, as there have been multiple failed attempts before this.

I'm also curious about the performance boost from 2.81Mbit/link failure at 150MHz to 65.4Mbit/31.4Mbit at 200MHz. That doesn't sound like basic processor bottlenecks, but rather some kind of catastrophic breakdown at a lower level? Does it just occasionally completely fail to lock onto an incoming clock signal or something?

I did some further investigating - it's apparently due to not having enough setup time on the RX pio SM. Even though the PIO clocking is fixed at 100 MHz, there are CRC errors at the lower system clocks. I tried changing the delay in the PIO instruction that starts the RX sampling, but that only made things worse (as expected). Also tried disabling the synchronizers with no improvement.

Usually I can grok the significance of almost any item on HN that catches my eye, but here I'm at a loss. Can someone explain why this matters?

As far as I can tell, someone has figured out how to send Ethernet packets at a relatively high rate using hardware with very limited CPU. Cool, but what can you _do with that_? If the RPi Pico has the juice to run interesting network _application-level traffic_ at line rate it's more intriguing, but I doubt that anyone's going to claim that can serve web traffic at line rate on this device, for example.

What am I missing?

RP2040/2350 are IO monsters. You could for example make a logic analyzer that transfers logic data through ethernet.

This "very limited" microcontroller has two cores. Both of them can execute about 25 instructions per byte for generating "application-level traffic". You could definitely saturate a 100 Mbps connection with just one core.

Now that you mention it, I think I would like to see a logic analyzer that does just that. No buffering, just straight up shovel the data to a mac address, or even IP address, and be done with it (maybe lose a few frames here and there). Let the PC worry about what to do with it, like triggers etc.

Should be cheap, right? Though 1Gbit version might still be expensive..

How is this different from the cheap salae clones now? Just sub out Ethernet for usb and that’s how they work now: a cheap ic with nothing but a2d and a usb phy samples and sends as fast as it can..

Its quite popular in the retro-computing scene, for example, to bring these old machines into the 21st century with modern microcontrollers being used to add peripheral support.

For example, the Oric-1/Atmos computers recently got a project called "LOCI" which adds USB support to the 40-year old computer[1], by using an RP2040's PIO capabilities to interface the 8-bit DATA bus with a microcontroller capable of acting as the 'gateway' to all of the devices on the USB peripheral bus.

This is amazing, frankly.

And now, being able to do Ethernet in such a simple way means that hundreds of retro-computing platforms could be put on the Internet with relative ease ..

[1] - https://forum.defence-force.org/viewtopic.php?t=2593&sid=2d3...

> Achieves 94.9 Mbit/sec when Pico is overclocked to 300 MHz, as measured by iperf

Is this an effective rate, or just the reflection of a hardware limit?

A 1500 byte (octet) MTU frame is 1538 bytes “on the wire”.

7 byte preamble

1 byte SFD

6 byte dst MAC

6 byte src MAC

2 byte ethertype or length

46-1500 bytes of payload (ignoring “Jumbo” frames and 802.1q tags)

4 byte CRC

12 byte IFG (which is silence, but still counts for time on the wire)

Add it up and you have 1538 bytes “on the wire”.

TCP overhead for IPv4 is 20 bytes for IP(v4) (no options) and 20 bytes for TCP (again, no options).

So 1460 bytes of data for 1538 bytes on the wire. 1460/1538 = 0.949284

So for 100M Ethernet, 94.9284Mbps is “perfect”.

“Line rate” is not “fill the link with TCP”. Line rate is “fill the link with 84 octet (including all overhead) frames.”

For 100M Ethernet this requires 148,809 packets per second.

Edit: for 1538 octet frames, one need only process 8,127 packets per second.

"Line rate" is "fill the 100Mbit link with 100 million bits each second". Of course the overhead is included in that, since the overhead also goes over the wire

I'm many years away from such topics but I don't remember this being the case, moreover specs for net equipment was (is) on pps with the details stating usually 2-3 packet size categories. I'm interested on some reference on what you wrote

https://www.fmad.io/blog/what-is-10g-line-rate

As the article calls it, the gold standard. If a device is capable of forwarding/switching packets at the smallest packet size line rate on all interfaces at the same time you don't have to think too much about its performance when designing your network. Haven't worked much with hardware for a few years but it was common that Cisco switches were not capable of this.

That 8200 for example is capable of line rate at the smallest packet size so that imix marketing is kinda useless. When evaluating these kinds of devices this is what matters.

IMIX makes sense on devices that are not capable of small packet line rate like firewalls where bandwidth is much more costly and need to be sized appropriately.

I don't have any Cisco core routers, not have I personally tested any, but that document I provided found their Q200 ASIC (in the 8000 series) required at least 170B frames to hit line rate:

> Both DUTs can achieve line rate performance on all ports with an NDR of 170 Bytes for the 88-LC0-36FH-M line card and 215 Bytes for 8201-32FH router. Same values were observed for both IPv4 and IPv6 traffic. This exceeds all real-life deployments requirements regardless of position in the network.

The 9000 series analysis reports something like 400B packets to hit line rate.

Fundamentally, everyone has to scale their internal bus width and clock rate to hit the headline numbers, always at the cost of small frame performance.

This is a lazy definition and won’t get you past “Go” when making network equipment. Why not use 9000 byte “Jumbo” frames? You’ll only need to process 1,383 packets per second to fill the link!

Back in the day, in the x86 world, there was this "rule of thumb" that you needed about 1GHz of CPU speed to saturate a 1Gbit network link. So a server with four 2GHz CPUs could saturate eight 1gbit links and still be somewhat useful.

This was AFAIR based on empirical knowledge, nothing scientific.

So a Pi Pico running at 300MHz pushing 100Mbit is something that is not totally unexpected, if you consider the low-power, low-cost CPU design in a Pi Pico (and the fact that you have to push the bits manually on the wire).

It's still a nice feat that they pulled this off!

（评论） (comments)

（评论）
(comments)