Show HN：C++17单头文件性能分析器

Show HN：C++17单头文件性能分析器
Show HN: Single-Header Profiler for C++17

原始链接: https://github.com/DmitriBogdanov/UTL/blob/master/docs/module_profiler.md

`utl::profiler` 是一个单头文件、本地化的 C++ 代码性能分析解决方案，它使用简单的宏来测量代码执行时间。它构建调用图并在每个线程输出格式化表格，从而轻松识别性能瓶颈。主要特性包括：用于作用域、表达式和代码段的宏；可定制的样式选项；线程安全的全局分析器对象；自动或手动从线程上传结果；以及用于生成报告字符串的格式化函数。例如：`UTL_PROFILER_SCOPE("label")` 测量作用域内花费的时间。该库通过使用整数 ID 和稠密矩阵来表示调用图从而优化性能。通过使用线性增长的 ID，该库将常见操作（时间测量之外）的成本降低到简单的整数数组查找。可以通过使用 CPU 内核指令或完全禁用性能分析来进一步降低开销。

DmitriBogdanov 提供了一个单头文件的 C++17 性能分析库，用于快速测量代码段的运行时间。该库解决了普遍存在的需要理解特定代码段性能影响的需求，尤其是在启用了优化且缺少合适的工具的情况下。该库使用类似 `PROFILE("Loop 1")` 的宏来自动记录后续表达式的运行时间，并将结果输出到表格中。与更简单的基于宏的解决方案不同，该库支持递归代码并处理线程。它利用宏生成的线程局部变量来有效地将调用站点与整数 ID 关联，从而能够在连续数组中构建调用图。这种方法通过依赖数组查找和无锁线程支持来最大限度地减少开销。创建者认为生成的库是独一无二且高度易用的。他寻求 Hacker News 社区的反馈，以了解该项目的设想和功能。

Show HN：Xcapture-BPF – 类似 Linux 顶级，但具有 Xray 视觉 2024-07-05

Show HN：基于大型语言模型的Spark Profiler 2025-04-09

就业市场恐慌 2024-07-18

Show HN：Hexi，面向 C++ 黑客的现代化仅头文件网络二进制序列化库 2025-03-28

Show HN：Coroot – 基于 eBPF 的开源可观测性工具，提供可执行的洞察 2025-04-08

原文

<- to README.md

<- to implementation.hpp

utl::profiler is single-include solution for localized profiling, it features simple macros to measure how much time is taken by a certain scope / expression / code segment. Profiler automatically builds a call graph for all profiled functions and prints a nicely formatted table for every thread. See examples.

Key features:

Below is an output example from profiling a JSON parser:

// Profiling macros
UTL_PROFILER_SCOPE(label);

UTL_PROFILER(label);

UTL_PROFILER_BEGIN(segment, label);
UTL_PROFILER_END(segment);

// Style options
struct Style {
    std::size_t indent = 2;
    bool        color  = true;

    double cutoff_red    = 0.40; // > 40% of total runtime
    double cutoff_yellow = 0.20; // > 20% of total runtime
    double cutoff_gray   = 0.01; // <  1% of total runtime
};

// Global profiler object
struct Profiler {
    void print_at_exit(bool value) noexcept;
    
    void upload_this_thread();
    
    std::string format_results(const Style& style = Style{});
};

inline Profiler profiler;

UTL_PROFILER_SCOPE(label);

Attaches profiler to the current scope.

If profiled scope was entered at any point of the program, upon exiting main() a per-thread call graph will be built for all profiled segments.

Note 1: label is a string literal name that will be shown in the results table.

Note 2: Automatic printing on exit can be disabled.

Attaches profiler to the scope of the following expression.

Convenient to profile individual loops / function calls / ifs and etc.

UTL_PROFILER_BEGIN(segment, label);
UTL_PROFILER_END(segment);

Attaches profiler to the code section between two BEGIN/END macros with the same segment label.

struct Style {
    std::size_t indent = 2;
    bool        color  = true;

    double cutoff_red    = 0.40; // > 40% of total runtime
    double cutoff_yellow = 0.20; // > 20% of total runtime
    double cutoff_gray   = 0.01; // <  1% of total runtime
};

A struct with formatting settings for Profiler::format_results().

void Profiler::print_at_exit(bool value) noexcept;

Sets whether profiling results should be automatically printed after exiting from main(). true by default.

Note: This and all other profiler object methods are thread-safe.

void Profiler::upload_this_thread();

Uploads profiling results from the current thread to the profiler object.

Can be used to upload results from detached threads. Otherwise results are automatically uploaded once detached thread joins another one.

std::string Profiler::format_results(const Style& style = Style{});

Formats profiling results to a string using given style options.

inline Profiler profiler;

Global profiler object.

Note

Online compiler explorer may be a little weird when it comes to sleep & time measurement precision.

[ Run this code ]

using namespace std::chrono_literals;

void computation_1() { std::this_thread::sleep_for(300ms); }
void computation_2() { std::this_thread::sleep_for(200ms); }
void computation_3() { std::this_thread::sleep_for(400ms); }
void computation_4() { std::this_thread::sleep_for(600ms); }
void computation_5() { std::this_thread::sleep_for(100ms); }

// ...

// Profile a scope
UTL_PROFILER_SCOPE("Computation 1 - 5");
computation_1();
computation_2();

// Profile an expression
UTL_PROFILER("Computation 3") computation_3();

// Profile a code segment
UTL_PROFILER_BEGIN(comp_45, "Computation 4 - 5");
computation_4();
computation_5();
UTL_PROFILER_END(comp_45);

Output:

[ Run this code ]

void recursive(int depth = 0) {
    if (depth > 4) {
        std::this_thread::sleep_for(std::chrono::milliseconds(50));
        return;
    }
    
    UTL_PROFILER("1st recursion branch") recursive(depth + 1);
    UTL_PROFILER("2nd recursion branch") recursive(depth + 2);
}

// ...

recursive();

Output:

Profiling parallel section

Note

In this example we will use utl::parallel to represent a parallel section concisely.

[ Run this code ]

using namespace utl;
using namespace std::chrono_literals;

// Run loop on the main thread
UTL_PROFILER("Single-threaded loop")
for (int i = 0; i < 30; ++i) std::this_thread::sleep_for(10ms);

// Run the same loop on 3 threads
parallel::set_thread_count(3);

UTL_PROFILER("Multi-threaded loop")
parallel::for_loop(parallel::IndexRange{0, 30}, [](int low, int high){
    UTL_PROFILER("Worker thread loop")
    for (int i = low; i < high; ++i) std::this_thread::sleep_for(10ms);
});

parallel::set_thread_count(0);

Output:

Profiling detached threads & uploading results

Note

In this example we will use utl::parallel to represent detached section concisely.

[ Run this code ]

using namespace utl;
using namespace std::chrono_literals;

parallel::set_thread_count(2);

// Detached task
UTL_PROFILER("Uploading task 1")
parallel::task([]{
    UTL_PROFILER("Detached task 1: part 1") std::this_thread::sleep_for(700ms);
});

// Detached task with explicit result upload
UTL_PROFILER("Uploading task 2")
parallel::task([]{
    UTL_PROFILER("Detached task 2: part 1") std::this_thread::sleep_for(50ms);
    UTL_PROFILER("Detached task 2: part 2") std::this_thread::sleep_for(50ms);

    // Manually upload results to the main thread,
    // otherwise results get collected once the thread joins
    profiler::profiler.upload_this_thread();

    UTL_PROFILER("Detached task 2: part 3") std::this_thread::sleep_for(500ms);
});

// Wait a little so the 2nd task has time to reach manual upload
UTL_PROFILER("Waiting for task 2 to be partially done")
std::this_thread::sleep_for(200ms);

// Format results explicitly
profiler::profiler.print_at_exit(false);

std::cout << profiler::profiler.format_results();

Output:

Custom style & exporting results to a file

[ Run this code ]

using namespace utl;
using namespace std::chrono_literals;

// Profile something
UTL_PROFILER("Loop")
for (int i = 0; i < 10; ++i) {
    UTL_PROFILER("1st half of the loop") std::this_thread::sleep_for(10ms);
    UTL_PROFILER("2nd half of the loop") std::this_thread::sleep_for(10ms);
}

// Disable automatic printing
profiler::profiler.print_at_exit(false);

// Disable colors, remove indent, format to string
profiler::Style style;
style.color  = false;
style.indent = 0;

const std::string results = profiler::profiler.format_results(style);

// Export to file & console
std::ofstream("profiling_results.txt") << results;
std::cout                              << results;

Output:

-------------------- UTL PROFILING RESULTS ---------------------

# Thread [main] (reuse 0) (running) (runtime -> 201.81 ms)
 - 99.99%  | 201.79 ms |                 Loop | example.cpp:8, main()  |
 - 49.91%  | 100.73 ms | 1st half of the loop | example.cpp:10, main() |
 - 50.07%  | 101.04 ms | 2nd half of the loop | example.cpp:11, main() |

Reducing overhead with x86 intrinsics

By far the most significant part of profiling overhead comes from calls to std::chrono::steady_clock::now().

It is possible to significantly reduce that overhead by using CPU-counter intrinsics. To do so simply define UTL_PROFILER_USE_INTRINSICS_FOR_FREQUENCY macro with a need frequency:

#define UTL_PROFILER_USE_INTRINSICS_FOR_FREQUENCY 3.3e9 // 3.3 GHz (AMD Ryzen 5 5600H)
#include "UTL/profiler.hpp"                             // will now use 'rdtsc' for timestamps

This is exceedingly helpful when profiling code on a hot path. Below are a few benchmarks showcasing the difference on particular hardware:

======= USING std::chrono ========

| relative |               ms/op |                op/s |    err% |     total | benchmark
|---------:|--------------------:|--------------------:|--------:|----------:|:----------
|   100.0% |                3.46 |              289.22 |    0.1% |      0.44 | `Runtime without profiling`
|    53.9% |                6.41 |              155.90 |    0.3% |      0.77 | `Theoretical best std::chrono profiler`
|    52.2% |                6.62 |              151.07 |    0.2% |      0.80 | `UTL_PROFILER()`

// very light workload - just 8 computations of 'std::cos()' per 2 time measurements, difficult to
// time and sensitive to overhead, here profiled code is ~2x slower then the non-profiled workload

====== USING __rdtsc() ======

| relative |               ms/op |                op/s |    err% |     total | benchmark
|---------:|--------------------:|--------------------:|--------:|----------:|:----------
|   100.0% |                3.50 |              286.11 |    0.6% |      0.43 | `Runtime without profiling`
|    86.3% |                4.05 |              247.01 |    0.2% |      0.49 | `Theoretical best __rdtsc() profiler`
|    73.7% |                4.74 |              210.97 |    0.3% |      0.57 | `UTL_PROFILER()`

// notable reduction in profiling overhead

Note

Here "theoretical best" refers to a hypothetical profiler that requires zero operations aside from measuring the time at two points — before and after entering the code segment.

To disable any profiling code from interfering with the program, simply define UTL_PROFILER_DISABLE before including the header:

#define UTL_PROFILER_DISABLE
#include "UTL/profiler.hpp"
// - the header is now stripped of any and all code and only provides no-op mocks of the public API,
//   this means no effectively no impact on compile times
// - 'profiler.format_results()' now returns "<profiling is disabled>"

A simple & naive way to construct a call graph would be through building a tree of nodes using std::unordered_map<std::string, Node> with call-site as a key. Such approach however makes the overhead of tree expansion & traversal incredibly high, rendering profiler useless for small tasks.

This library uses a bunch of thread_local variables (created by macros) to correlate call-sites with integer IDs and reduces tree traversal logic to traversing a "network" of indices encoded as a dense $M \times N$ matrix where $M$ — number of call-sites visited by this thread, $N$ — number of nodes in the call graph.

There are some additional details & arrays, but the bottom-line is that by associating everything we can with linearly growing IDs and delaying "heavy" things as much as possible until thread destruction / formatting, we can reduce almost all common operations outside of time measurement to trivial integer array lookups.

This way, the cost of re-entry on existing call graph nodes (aka the fast path taken most of the time) is reduced down to a single array lookup & branch that gets predicted most of the time.

New call-site entry & new node creation are rare slow paths, they only happen during call-graph expansion and will have very little contribution to the runtime outside of measuring very deep recursion. By using an std::vector-like allocation strategy for both rows & columns it is possible to make reallocation amortized $O(1)$.

Memory overhead of profiling is mostly defined by the aforementioned call graph matrix. For example, on thread that runs into 20 profiling macros and creates 100 nodes, memory overhead is going to be 8 kB. A thread that runs into 100 profiling macros and creates 500 call graph nodes, memory overhead will be 0.2 MB.

It is possible to further reduce memory overhead (down to 4 kB and 0.1 MB) by defining a UTL_PROFILER_USE_SMALL_IDS macro before the include:

#define UTL_PROFILER_USE_SMALL_IDS
#include "UTL/profiler.hpp"

This switches implementation to 16-bit IDs, which limits the max number of nodes to 65535. For most practical purposes this should be more than enough as most machines will reach stack overflow far before reaching such depth of the call graph.

Almost all profiling is lock-free, there are only 3 points at which implementation needs to lock a mutex:

When creating a new thread
When joining a thread
When manually calling profiler.upload_this_thread()

All public API is thread-safe.