Show HN:更快的 C++
Less Slow C++

原始链接: https://github.com/ashvardanian/less_slow.cpp

这个C++基准测试仓库是一个关于性能导向软件设计的实用指南。它提供了高效的C和C++代码示例,利用C++20特性以及通过CMake导入的标准库。 仓库重点介绍了常见的性能陷阱,并探讨了超越基本编译器标志的优化方法,涵盖了从微内核到并行算法、协程和多态性的主题。它包括更快三角函数、用于惰性逻辑的自定义范围和优化的JSON处理等技术。它还检查了分支、递归和异常对性能的影响。 该套件利用Google Benchmark,并演示了其用于分析和性能分析的高级功能。此外,它还涉及特定于硬件的优化,包括用于CPU的汇编内核,用于Nvidia GPU的CUDA,并讨论了各种GPU架构和编程模型之间的差异。它还介绍了与加密安全区及其延迟相关的概念。 仓库包含Linux、MacOS和Windows的安装说明。提供了构建和运行基准测试、控制输出和使用性能计数器的说明。

Ashvardanian分享了一个月关于C++性能研究的发现,重点关注分布式系统的微优化。这项研究汇编在一个扩展了Google Benchmark教程的GitHub仓库中,探讨了协程的实用性、SIMD与汇编的比较、AVX-512散射/聚集性能、CPU/GPU Tensor Core的比较、内存访问成本、标准库性能瓶颈、错误处理开销、惰性求值权衡、元编程用例、使用io_uring绕过Linux内核与POSIX套接字的比较、网络TS/异构执行器以及有状态分配器的传播等主题。 主要观察结果包括:编译器能有效地将小型矩阵乘法向量化;Nvidia Tensor Core的性能在不同世代之间差异很大;由于AI浪潮,CPU和GPU在矩阵乘法性能方面正在趋同;标量正弦近似值可能比标准实现快得多;CTRE可以胜过正则表达式引擎;DPDK/SPDK和io_uring之间的性能差距正在缩小。该仓库包含相关资源的链接,其中一些示例已移植到Rust和Python。虽然指针标记和安全飞地仍然难以捉摸,但作者希望就高级综合(High-Level Synthesis)与手工编写的VHDL/Verilog用于FPGA以及其他与性能相关的主题征求意见。

原文

The benchmarks in this repository don't aim to cover every topic entirely, but they help form a mindset and intuition for performance-oriented software design. It also provides an example of using some non-STL but de facto standard libraries in C++, importing them via CMake, and compiling from source. For higher-level abstractions and languages, check out less_slow.rs and less_slow.py.

Much modern code suffers from common pitfalls, such as bugs, security vulnerabilities, and performance bottlenecks. University curricula often teach outdated concepts, while bootcamps oversimplify crucial software development principles.

Less Slow C++

This repository offers practical examples of writing efficient C and C++ code. It leverages C++20 features and is designed primarily for GCC and Clang compilers on Linux, though it may work on other platforms. The topics range from basic micro-kernels executing in a few nanoseconds to more complex constructs involving parallel algorithms, coroutines, and polymorphism. Some of the highlights include:

  • 100x cheaper random inputs?! Discover how input generation sometimes costs more than the algorithm.
  • 40x faster trigonometry: Speed-up standard library functions like std::sin in just 3 lines of code.
  • 4x faster lazy-logic with custom std::ranges and iterators!
  • Compiler optimizations beyond -O3: Learn about less obvious flags and techniques for another 2x speedup.
  • Multiplying matrices? Check how a 3x3x3 GEMM can be 70% slower than 4x4x4, despite 60% fewer ops.
  • Scaling AI? Measure the gap between theoretical ALU throughput and your BLAS.
  • How many if conditions are too many? Test your CPU's branch predictor with just 10 lines of code.
  • Prefer recursion to iteration? Measure the depth at which your algorithm with SEGFAULT.
  • Why avoid exceptions? Take std::error_code or std::variant-like wrappers?
  • Scaling to many cores? Learn how to use OpenMP, Intel's oneTBB, or your custom thread pool.
  • How to handle JSON avoiding memory allocations? Is it easier with C++ 20 or old-school C 99 tools?
  • How to properly use STL's associative containers with custom keys and transparent comparators?
  • How to beat a hand-written parser with consteval RegEx engines?
  • Is the pointer size really 64 bits and how to exploit pointer-tagging?
  • How many packets is UDP dropping and how to serve web requests in io_uring from user-space?
  • Scatter and Gather for 50% faster vectorized disjoint memory operations.
  • Intel's oneAPI vs Nvidia's CCCL? What's so special about <thrust> and <cub>?
  • CUDA C++, PTX Intermediate Representations, and SASS, and how do they differ from CPU code?
  • How to choose between intrinsics, inline asm, and separate .S files for your performance-critical code?
  • Tensor Cores & Memory differences on CPUs, and Volta, Ampere, Hopper, and Blackwell GPUs!
  • How coding FPGA differs from GPU and what is High-Level Synthesis, Verilog, and VHDL? 🔜 #36
  • What are Encrypted Enclaves and what's the latency of Intel SGX, AMD SEV, and ARM Realm? 🔜 #31

To read, jump to the less_slow.cpp source file and read the code snippets and comments. Follow the instructions below to run the code in your environment and compare it to the comments as you read through the source.

The project aims to be compatible with GCC, Clang, and MSVC compilers on Linux, MacOS, and Windows. That said, to cover the broadest functionality, using GCC on Linux is recommended:

  • If you are on Windows, it's recommended that you set up a Linux environment using WSL.
  • If you are on MacOS, consider using the non-native distribution of Clang from Homebrew or MacPorts.
  • If you are on Linux, make sure to install CMake and a recent version of GCC or Clang compilers to support C++20 features.

If you are familiar with C++ and want to review code and measurements as you read, you can clone the repository and execute the following commands.

git clone https://github.com/ashvardanian/less_slow.cpp.git # Clone the repository
cd less_slow.cpp                                            # Change the directory

pip install cmake --upgrade                                 # PyPI has a newer version of CMake
sudo apt-get install -y build-essential g++                 # Install default build tools
sudo apt-get install -y pkg-config liburing-dev             # Install liburing for kernel-bypass
sudo apt-get install -y libopenblas-base                    # Install numerics libraries

cmake -B build_release -D CMAKE_BUILD_TYPE=Release          # Generate the build files
cmake --build build_release --config Release                # Build the project
build_release/less_slow                                     # Run the benchmarks

The build will pull and compile several third-party dependencies from the source:

  • Google's Benchmark is used for profiling.
  • Intel's oneTBB is used as the Parallel STL backend.
  • Meta's libunifex is used for senders & executors.
  • Eric Niebler's range-v3 replaces std::ranges.
  • Victor Zverovich's fmt replaces std::format.
  • Ash Vardanian's StringZilla replaces std::string.
  • Hana Dusíková's CTRE replaces std::regex.
  • Niels Lohmann's json is used for JSON deserialization.
  • Yaoyuan Guo's yyjson for faster JSON processing.
  • Google's Abseil replaces STL's associative containers.
  • Lewis Baker's cppcoro implements C++20 coroutines.
  • Jens Axboe's liburing to simplify Linux kernel-bypass.
  • Chris Kohlhoff's ASIO as a networking TS extension.
  • Nvidia's CCCL for GPU-accelerated algorithms.
  • Nvidia's CUTLASS for GPU-accelerated Linear Algebra.

To control the output or run specific benchmarks, use the following flags:

build_release/less_slow --benchmark_format=json             # Output in JSON format
build_release/less_slow --benchmark_out=results.json        # Save the results to a file instead of `stdout`
build_release/less_slow --benchmark_filter=std_sort         # Run only benchmarks containing `std_sort` in their name

To enhance stability and reproducibility, disable Simultaneous Multi-Threading (SMT) on your CPU and use the --benchmark_enable_random_interleaving=true flag, which shuffles and interleaves benchmarks as described here.

build_release/less_slow --benchmark_enable_random_interleaving=true

Google Benchmark supports User-Requested Performance Counters through libpmf. Note that collecting these may require sudo privileges.

sudo build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_format=json --benchmark_perf_counters="CYCLES,INSTRUCTIONS"

Alternatively, use the Linux perf tool for performance counter collection:

sudo perf stat taskset 0xEFFFEFFFEFFFEFFFEFFFEFFFEFFFEFFF build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_filter=super_sort

The primary file of this repository is clearly the less_slow.cpp C++ file with CPU-side code. Several other files for different hardware-specific optimizations are created:

$ tree .
.
├── CMakeLists.txt          # Build & assembly instructions for all files
├── less_slow.cpp           # Primary CPU-side benchmarking code with the majority of examples
├── less_slow_amd64.S       # Hand-written Assembly kernels for 64-bit x86 CPUs
├── less_slow_aarch64.S     # Hand-written Assembly kernels for 64-bit Arm CPUs
├── less_slow.cu            # CUDA C++ examples for parallel algorithms for Nvidia GPUs
├── less_slow_sm70.ptx      # Hand-written PTX IR kernels for Nvidia Volta GPUs
└── less_slow_sm90a.ptx     # Hand-written PTX IR kernels for Nvidia Hopper GPUs

Educational content without memes?! Come on!

Google Benchmark Functionality

This benchmark suite uses most of the features provided by Google Benchmark. If you write a lot of benchmarks and avoid going to the full User Guide, here is a condensed list of the most useful features:

  • ->Args({x, y}) - Pass multiple arguments to parameterized benchmarks
  • BENCHMARK() - Register a basic benchmark function
  • BENCHMARK_CAPTURE() - Create variants of benchmarks with different captured values
  • Counter::kAvgThreads - Specify thread-averaged counters
  • DoNotOptimize() - Prevent compiler from optimizing away operations
  • ClobberMemory() - Force memory synchronization
  • ->Complexity(oNLogN) - Specify and validate algorithmic complexity
  • ->SetComplexityN(n) - Set input size for complexity calculations
  • ->ComputeStatistics("max", ...) - Calculate custom statistics across runs
  • ->Iterations(n) - Control exact number of iterations
  • ->MinTime(n) - Set minimum benchmark duration
  • ->MinWarmUpTime(n) - To warm up the data caches
  • ->Name("...") - Assign custom benchmark names
  • ->Range(start, end) - Profile for a range of input sizes
  • ->RangeMultiplier(n) - Set multiplier between range values
  • ->ReportAggregatesOnly() - Show only aggregated statistics
  • state.counters["name"] - Create custom performance counters
  • state.PauseTiming(), ResumeTiming() - Control timing measurement
  • state.SetBytesProcessed(n) - Record number of bytes processed
  • state.SkipWithError() - Skip benchmark with error message
  • ->Threads(n) - Run benchmark with specified number of threads
  • ->Unit(kMicrosecond) - Set time unit for reporting
  • ->UseRealTime() - Measure real time instead of CPU time
  • ->UseManualTime() - To feed custom timings for GPU and IO benchmarks
联系我们 contact @ memedata.com