（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=43727743

Ashvardanian分享了一个月关于C++性能研究的发现，重点关注分布式系统的微优化。这项研究汇编在一个扩展了Google Benchmark教程的GitHub仓库中，探讨了协程的实用性、SIMD与汇编的比较、AVX-512散射/聚集性能、CPU/GPU Tensor Core的比较、内存访问成本、标准库性能瓶颈、错误处理开销、惰性求值权衡、元编程用例、使用io_uring绕过Linux内核与POSIX套接字的比较、网络TS/异构执行器以及有状态分配器的传播等主题。主要观察结果包括：编译器能有效地将小型矩阵乘法向量化；Nvidia Tensor Core的性能在不同世代之间差异很大；由于AI浪潮，CPU和GPU在矩阵乘法性能方面正在趋同；标量正弦近似值可能比标准实现快得多；CTRE可以胜过正则表达式引擎；DPDK/SPDK和io_uring之间的性能差距正在缩小。该仓库包含相关资源的链接，其中一些示例已移植到Rust和Python。虽然指针标记和安全飞地仍然难以捉摸，但作者希望就高级综合（High-Level Synthesis）与手工编写的VHDL/Verilog用于FPGA以及其他与性能相关的主题征求意见。

LLaMA 现在在 CPU 上运行得更快 2024-04-02

（评论） 2023-11-30

用 150 行 C 语言击败 NumPy 矩阵乘法 2024-07-05

（评论） 2024-07-30

原文

login

		Show HN: Less Slow C++ (github.com/ashvardanian)
		31 points by ashvardanian 1 hour ago \| hide \| past \| favorite \| 2 comments

		Earlier this year, I took a month to reexamine my coding habits and rethink some past design choices. I hope to rewrite and improve my FOSS libraries this year, and I needed answers to a few questions first. Perhaps some of these questions will resonate with others in the community, too. - Are coroutines viable for high-performance work? - Should I use SIMD intrinsics for clarity or drop to assembly for easier library distribution? - Has hardware caught up with vectorized scatter/gather in AVX-512 & SVE? - How do secure enclaves & pointer tagging differ on Intel, Arm, & AMD? - What's the throughput gap between CPU and GPU Tensor Cores (TCs)? - How costly are misaligned memory accesses & split-loads, and what gains do non-temporal loads/stores offer? - Which parts of the standard library hit performance hardest? - How do error-handling strategies compare overhead-wise? - What's the compile-time vs. run-time trade-off for lazily evaluated ranges? - What practical, non-trivial use cases exist for meta-programming? - How challenging is Linux Kernel bypass with io_uring vs. POSIX sockets? - How close are we to effectively using Networking TS or heterogeneous Executors in C++? - What are best practices for propagating stateful allocators in nested containers, and which libraries support them? These questions span from micro-kernel optimizations (nanoseconds) to distributed systems (micro/millisecond latencies). Rather than tackling them all in one post, I compiled my explorations into a repository—extending my previous Google Benchmark tutorial (<https://ashvardanian.com/posts/google-benchmark>)—to serve as a sandbox for performance experimentation. Some fun observations: - Compilers now vectorize 3x3x3 and 4x4x4 single/double precision multiplications well! The smaller one is ~60% slower despite 70% fewer operations, outperforming my vanilla SSE/AVX and coming within 10% of AVX-512. - Nvidia TCs vary dramatically across generations in numeric types, throughput, tile shapes, thread synchronization (thread/quad-pair/warp/warp-groups), and operand storage. Post-Volta, manual PTX is often needed (as intrinsics lag), though the new TileIR (introduced at GTC) promises improvements for dense linear algebra kernels. - The AI wave drives CPUs and GPUs to converge in mat-mul throughput & programming complexity. It took me a day to debug TMM register initialization, and SME is equally odd. Sierra Forest packs 288 cores/socket, and AVX10.2 drops 256-bit support for 512-bit... I wonder if discrete Intel GPUs are even needed, given CPU advances? - In common floating-point ranges, scalar sine approximations can be up to 40x faster than standard implementations, even without SIMD. It's a bit hand-wavy, though; I wish more projects documented error bounds and had 1 & 3.5 ULP variants like Sleef. - Meta-programming tools like CTRE can outperform typical RegEx engines by 5x and simplify building parsers compared to hand-crafted FSMs. - Once clearly distinct in complexity and performance (DPDK/SPDK vs. io_uring), the gap is narrowing. While pre-5.5 io_uring can boost UDP throughput by 4x on loopback IO, newer zero-copy and concurrency optimizations remain challenging. The repository is loaded with links to favorite CppCon lectures, GitHub snippets, and tech blog posts. Recognizing that many high-level concepts are handled differently across languages, I've also started porting examples to Rust & Python in separate repos. Coroutines look bad everywhere :( Overall, this research project was rewarding! Most questions found answers in code — except pointer tagging and secure enclaves, which still elude me in public cloud. I'd love to hear from others, especially on comparing High-Level Synthesis for small matrix multiplications on FPGAs versus hand-written VHDL/Verilog for integral types. Let me know if you have ideas for other cool, obscure topics to cover!

kreetx 23 minutes ago | [–]

This is only from a cursory look, but it all looks really concise and practical - thank you for putting it all together!

reply

intelVISA 20 minutes ago | [–]

Awesome, thanks.

Does C++ have a good async ('coroutine') story for io_uring yet?

reply

（评论） (comments)

（评论）
(comments)