过早优化有时也很有趣

过早优化有时也很有趣
Premature Optimization Is Fun Sometimes

原始链接: https://invlpg.com/posts/2025-06-19-premature-optimization.html

本文通过一个简单的连通性监控系统，探讨了 C 语言结构体优化的细微之处。作者从一个用于存储 Ping 数据的简单 `struct` 开始，通过迭代优化来减小内存占用。这一过程涉及多个技术难点： 1. **数据精简：** 通过使用标签联合体（tagged union）并降低时间精度（从纳秒降至 100 微秒）来节省空间。 2. **处理填充：** 解决了因内存对齐要求和填充字节导致位域（bitfields）无法减小结构体大小这一常见陷阱。 3. **语义优化：** 用 4 位滚动计数器代替 32 位的 `in_addr_t` 来唯一标识源地址，成功将结构体压缩至单个内存页内。 4. **指令优化：** 通过重新排列位域以确保最佳的 CPU 访问方式（使用 `ldrh` 指令），并运用逻辑翻转技巧（使用 `not_received` 代替 `received`），使编译器能够省略不必要的掩码指令。尽管作者承认该应用本身并无内存限制，但这一练习突显了对数据布局和编译器行为的深度掌控如何能最小化内存使用并最大化 CPU 效率——这既是出于对性能的追求，也是出于学术性的好奇。

Hacker News 最新 | 往日 | 评论 | 提问 | 展示 | 招聘 | 提交登录过早优化有时也很有趣 (invlpg.com) 5 分 | 由 throawayonthe 发布于 1 小时前 | 隐藏 | 往日 | 收藏 | 讨论 | 帮助准则 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

invlpg – Premature Optimization is Fun Sometimes

A colleague of mine was recently discussing a connectivity monitoring system he is working on with me. It’s nothing fancy, just sending ICMP Echo Requests to a couple of different servers, and monitoring latency and dropped packet averages over 1-minute, 5-minute, and 15-minute periods. Up came the topic of how this data should be stored, the natural thought was a 512 entry ring buffer, containing entries like the following:

struct ping_timestamp {
    uint64_t sent_ns;       // when the packed was sent (unit: ns)
    uint64_t received_ns;   // when the packet was received (unit: ns)
    in_addr_t source_addr;  // source address
    uint16_t seq_no;        // echo request sequence number
    bool received;          // has the request been received?
};

struct ping_timestamp pings_rb[512];

// ...

printf("%zu\n", sizeof(pings_rb));

// 12288

struct ping_timestamp_2 {
    union {
        uint64_t sent_ts;       // unit: 100μs
        uint64_t elapsed_ts;    // unit: 100μs
    };
    in_addr_t source_addr;
    uint16_t seq_no;
    bool received;
};

// ...

printf("%zu\n", sizeof(pings_rb));

// 8192

Not bad, we’ve shaved off an entire page. We can still do better.

Nanosecond precision? In our case, ping times are measured in the tens or hundreds, or even thousands of milliseconds. We don’t need to keep all of those extra bits around. If we change the unit from nanosecond to 100 microsecond increments (0.1ms), then 43-bits is sufficient for us to keep track of pings for up to 20-years. 20-years is a bit excessive still, but it doesn’t hurt to be at least a little bit future proof.

And received? 8-bit for a true/false value seems altogether too much. The answer: bitfields.

struct ping_timestamp_3 {
    uint64_t sent_or_elapsed_ts: 43;
    uint64_t received: 1;
    uint64_t seq_no: 16;
    in_addr_t source_addr;
};

// ...

printf("%zu\n", sizeof(pings_rb));

// 8192

Wait, what? Why haven’t we saved any space?

The answer is struct padding. The layout of ping_timestamp_2 looks like this:

     0       8      16      24      32      40      48      56      64...
     +-------+-------+-------+-------+-------+-------+-------+-------+
     |                         sent/elapsed                          |
     +-------+-------+-------+-------+-------+-------+-------+-------+
 ...64      72      80      88      96      104     112     120     128
     +-------+-------+-------+-------+-------+-------+-------+-------+
     |         source_address        |    seq_no     | recv? |  pad  |
     +---------------------------------------------------------------+

Where the padding byte at the end is to ensure alignment requirements. ping_timestamp_3 on the other end, looks like this:

     0       8      16      24      32      40  R?  48      56      64...
     +-------+-------+-------+-------+-------+-------+-------+-------+
     |                 sent/elapsed            ||     seq_no     | P |
     +-------+-------+-------+-------+-------+-------+-------+-------+
 ...64      72      80      88      96      104     112     120     128
     +-------+-------+-------+-------+-------+-------+-------+-------+
     |         source_address        |            padding            |
     +---------------------------------------------------------------+

So our optimization there didn’t actually save any space. We’re wasting 36-bits of padding. Is there any way we can somehow do better?

We keep track of the source address due to frequent changes while our product is in operation (on a mobile data network). When the address changes, we also reset the sequence number for reasons that aren’t relevant to the current topic. We have seen, in the past, packets with differing source addreses but identical sequence numbers due to be processed by our application at the same time (the joys of asynchronous programming), so the source address serves to disambiguate these changes.

But there’s another way to disambiguate.

An ICMP echo request has a 16-bit long identifier field to allow applications to identify which echo request packets were sent by them. Its value is completely arbitray. On Linux iputils ping sets it to getpid() & 0xFFFF; on OpenBSD a random number is used instead.

Although it’s 16-bits long, we don’t actually need to use the full 16-bits. There’s 4 free bits left in the first 8-bytes of our ping_timestamp_3. Our thought was to use a rolling 4-bit counter, that is increased whenever our source address changes (this is monitored elsewhere in the application), allowing us to uniquely identify which source address the packet came from.

Our final struct looks like this:

struct ping_timestamp {
    uint64_t elapsed_or_sent_ts : 43;
    uint64_t received : 1;
    uint64_t counter: 4;
    uint64_t seq_no: 16;
};

// ...

printf("%zu\n", sizeof(pings_rb));

// 4096

Much better. A whole 8-kilobytes of savings, and down to a single page of data. You may have noticed that I have changed the order of the fields slightly. This is to line seq_no up on a 16-bit boundary, so that loading it is a single ldrh instruction rather than require a shift. Similarly, reading from elapsed_or_sent_ts only requires a mask.

In the end, this was a completely pointless exercise. Our application isn’t remotely memory constrained.

But it was fun.

received bit only requires a shift instruction rather than a shift and a mask:

struct ping_timestamp {
    uint64_t elapsed_or_sent_ts : 43;
    uint64_t counter: 4;
    uint64_t received : 1;
    uint64_t seq_no: 16;
};

There’s a slight “issue” with the above code: received is now much cheaper to access, at the expense of counter needing to mask out the received bit.

But we can fix this! We only ever read counter when received is true, i.e. 1. If received were zero, and we could tell the compiler to assume that it were zero, then no mask would be necessary.

The solution? Flip the meaning of the received bit.

struct ping_timestamp {
    uint64_t elapsed_or_sent_ts : 43;
    uint64_t counter: 4;
    uint64_t not_received : 1;
    uint64_t seq_no: 16;
};

then the compiler is able to completely elide the mask.

过早优化有时也很有趣 Premature Optimization Is Fun Sometimes

过早优化有时也很有趣
Premature Optimization Is Fun Sometimes