(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43727743

Ashvardanian分享了一个月关于C++性能研究的发现,重点关注分布式系统的微优化。这项研究汇编在一个扩展了Google Benchmark教程的GitHub仓库中,探讨了协程的实用性、SIMD与汇编的比较、AVX-512散射/聚集性能、CPU/GPU Tensor Core的比较、内存访问成本、标准库性能瓶颈、错误处理开销、惰性求值权衡、元编程用例、使用io_uring绕过Linux内核与POSIX套接字的比较、网络TS/异构执行器以及有状态分配器的传播等主题。 主要观察结果包括:编译器能有效地将小型矩阵乘法向量化;Nvidia Tensor Core的性能在不同世代之间差异很大;由于AI浪潮,CPU和GPU在矩阵乘法性能方面正在趋同;标量正弦近似值可能比标准实现快得多;CTRE可以胜过正则表达式引擎;DPDK/SPDK和io_uring之间的性能差距正在缩小。该仓库包含相关资源的链接,其中一些示例已移植到Rust和Python。虽然指针标记和安全飞地仍然难以捉摸,但作者希望就高级综合(High-Level Synthesis)与手工编写的VHDL/Verilog用于FPGA以及其他与性能相关的主题征求意见。


原文
Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Less Slow C++ (github.com/ashvardanian)
31 points by ashvardanian 1 hour ago | hide | past | favorite | 2 comments
Earlier this year, I took a month to reexamine my coding habits and rethink some past design choices. I hope to rewrite and improve my FOSS libraries this year, and I needed answers to a few questions first. Perhaps some of these questions will resonate with others in the community, too.

  - Are coroutines viable for high-performance work?
  - Should I use SIMD intrinsics for clarity or drop to assembly for easier library distribution?
  - Has hardware caught up with vectorized scatter/gather in AVX-512 & SVE?
  - How do secure enclaves & pointer tagging differ on Intel, Arm, & AMD?
  - What's the throughput gap between CPU and GPU Tensor Cores (TCs)?
  - How costly are misaligned memory accesses & split-loads, and what gains do non-temporal loads/stores offer?
  - Which parts of the standard library hit performance hardest?
  - How do error-handling strategies compare overhead-wise?
  - What's the compile-time vs. run-time trade-off for lazily evaluated ranges?
  - What practical, non-trivial use cases exist for meta-programming?
  - How challenging is Linux Kernel bypass with io_uring vs. POSIX sockets?
  - How close are we to effectively using Networking TS or heterogeneous Executors in C++?
  - What are best practices for propagating stateful allocators in nested containers, and which libraries support them?
These questions span from micro-kernel optimizations (nanoseconds) to distributed systems (micro/millisecond latencies). Rather than tackling them all in one post, I compiled my explorations into a repository—extending my previous Google Benchmark tutorial (<https://ashvardanian.com/posts/google-benchmark>)—to serve as a sandbox for performance experimentation.

Some fun observations:

  - Compilers now vectorize 3x3x3 and 4x4x4 single/double precision multiplications well! The smaller one is ~60% slower despite 70% fewer operations, outperforming my vanilla SSE/AVX and coming within 10% of AVX-512.
  - Nvidia TCs vary dramatically across generations in numeric types, throughput, tile shapes, thread synchronization (thread/quad-pair/warp/warp-groups), and operand storage. Post-Volta, manual PTX is often needed (as intrinsics lag), though the new TileIR (introduced at GTC) promises improvements for dense linear algebra kernels.
  - The AI wave drives CPUs and GPUs to converge in mat-mul throughput & programming complexity. It took me a day to debug TMM register initialization, and SME is equally odd. Sierra Forest packs 288 cores/socket, and AVX10.2 drops 256-bit support for 512-bit... I wonder if discrete Intel GPUs are even needed, given CPU advances?
  - In common floating-point ranges, scalar sine approximations can be up to 40x faster than standard implementations, even without SIMD. It's a bit hand-wavy, though; I wish more projects documented error bounds and had 1 & 3.5 ULP variants like Sleef.
  - Meta-programming tools like CTRE can outperform typical RegEx engines by 5x and simplify building parsers compared to hand-crafted FSMs.
  - Once clearly distinct in complexity and performance (DPDK/SPDK vs. io_uring), the gap is narrowing. While pre-5.5 io_uring can boost UDP throughput by 4x on loopback IO, newer zero-copy and concurrency optimizations remain challenging.
The repository is loaded with links to favorite CppCon lectures, GitHub snippets, and tech blog posts. Recognizing that many high-level concepts are handled differently across languages, I've also started porting examples to Rust & Python in separate repos. Coroutines look bad everywhere :(

Overall, this research project was rewarding! Most questions found answers in code — except pointer tagging and secure enclaves, which still elude me in public cloud. I'd love to hear from others, especially on comparing High-Level Synthesis for small matrix multiplications on FPGAs versus hand-written VHDL/Verilog for integral types. Let me know if you have ideas for other cool, obscure topics to cover!











This is only from a cursory look, but it all looks really concise and practical - thank you for putting it all together!


Awesome, thanks.

Does C++ have a good async ('coroutine') story for io_uring yet?







Join us for AI Startup School this June 16-17 in San Francisco!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com