通过记录 Git 哈希值实现可重现的 C++ 构建

通过记录 Git 哈希值实现可重现的 C++ 构建
Reproducible C++ builds by logging Git hashes

## 实验输出中的代码版本跟踪作者在研究中经常遇到一个问题：需要识别生成特定输出文件的确切代码版本，尤其是在算法快速开发、进行大量调整和配置时。一旦代码演进超出易于跟踪的部署范围，标准的版本控制就会变得困难。所采用的解决方案是将 Git 提交哈希嵌入到日志文件中。一个脚本使用 `git rev-parse HEAD` 生成一个 C++ 头文件，定义一个 `GIT_COMMIT_HASH` 宏，其值为当前提交的哈希值。该脚本集成到 CMake 构建过程中，确保哈希值被编译到程序中。为了处理未提交的更改（“dirty”构建），脚本会将“-dirty”附加到哈希值，并在运行时检查警告用户是否正在使用未提交的代码运行。虽然可以进行改进，例如跟踪*哪些*文件是“dirty”的或包含差异，但当前系统为个人研究目的提供了足够的可追溯性，可以轻松识别负责任何给定输出的代码状态。

## 可复现的 C++ 构建：总结这次 Hacker News 讨论的核心是实现 C++ 构建的可复现性——确保从相同的源代码生成完全相同的二进制文件。原始帖子详细描述了一种在编译期间记录 Git 哈希值的简单系统，用于跟踪代码库状态，以便后续实现可复现性。然而，评论者很快指出这仅仅是*第一步*。虽然记录 Git 哈希值有助于追溯性，但它并不能保证位对位完全相同的构建。建议使用多种工具来实现更强大的解决方案：`git describe` 提供了一种更简单的方法来捕获提交信息，而 Nix 和 Guix 则提供了全面的依赖项和环境管理。对话强调了真正可复现构建的复杂性，它不仅仅局限于源代码，还包括控制工具链版本、外部库，甚至消除时间戳等非确定性元素。最终目标是为每次构建提供可追溯的“物料清单”，从而实现可靠的调试和验证。许多用户分享了使用 ClearCase、Bazel 和虚拟化等工具来实现这一级别控制的经验。

原文

November 14, 2025

Sometimes I am in the difficult situation where I have written a program which writes some kind of output to disk, and I want to remember which version of my program produced this output. This is really common for me at the moment due to my research, which always seems to involve a lot of trial and error algorithm design. I think that similar problems exist in all kinds of other areas, but particularly during rapid development, because once software has been properly deployed and versioned it’s quite trivial to just put the version number in the logs.

For a slightly more, but not very, concrete example: I’m working on an algorithm implementation right now. I won’t say too much about the details yet, but it takes a number of configuration options. These, of course, I can quite easily write to the log file. The program also has lots of implementation details that can be tweaked, really dozens of things I could change, and I keep coming up with new ideas I want to try. This means that I end up with a folder full of outputs generated by code that probably doesn’t even exist anymore, and which I wouldn’t be able to reproduce purely by running the current version of the program with whatever configuration options are specified in the log file.

This also isn’t the first time I’ve had a very similar problem. I assume (hope) it’s not just me, so I thought I’d write up the solution I came up with.

Git commit hashes

As you likely know, you can identify git commits by their hash, which are long strings of hexadecimal digits, such as b5a994c260105b7cc979aead986532b51c37df75. Specifically, they are 40 characters long, and are the result of hashing the repository with SHA-1.

My idea is pretty simple: make the program write the current commit’s hash to the log file. Then, given any log file, I can see the commit used to generate it, and go back in the git history to see exactly what my code was doing at that point.

Basic Implementation

How to integrate this into the logs? A super easy but incorrect approach would be to invoke git from my program directly, retrieve the hash of the current commit, and write it to the log file. This doesn’t work though, because that will give us the git commit state at runtime, whereas we want to know what commit was used when compiling the code.

What we actually need to do is integrate the commit hash into the build system. Since I’m writing my code in C++, the natural way to implement compile-time information like this is to #define it, so let’s start by writing a script which builds a C++ header file to do just that:

#!/usr/bin/bash
commit_hash=$(git rev-parse HEAD)
echo "#pragma once"
echo "#define GIT_COMMIT_HASH \"${commit_hash}\""

Fairly simple: we’re just defining a macro GIT_COMMIT_HASH with a string literal of whatever git rev-parse HEAD says, which will be the hash of whatever the current checked out commit is. I will say, there’s probably a “proper” C++ way to define compile-time literals like this, with “proper” type checking, or something, #define is good enough.

The final step really is to, in some way, run this script at every compilation. For reference (because I always forget), CMAKE_BINARY_DIR is where you run cmake from, which for my is /build; and CMAKE_SOURCE_DIR is the root of the cmake project, i.e. where the CMakeLists.txt is. I appended the following to my CMakeLists.txt.

add_custom_target(git_info ALL
  COMMAND scripts/gen_git_info.sh > ${CMAKE_BINARY_DIR}/git_info.h
  WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
  COMMENT "Generating git info header"
)

add_dependencies(my_program git_info)
include_directories(${CMAKE_BINARY_DIR})

This target just tells cmake that I want to run that specified command, which runs our script and writes the output to a new header file in the build directory. Since I’m writing the header file to the build directory, I have to then add that as an include directory for my program. Of course, if you’re using a plain makefile, you need some other method. It’s probably even simpler, maybe make a phony target for running the script and producing git_info.h, and make it a dependency.

From C++ it’s then very simple:

#include "git_info.h"
std::string git_info = GIT_COMMIT_HASH;
std::cout << "git_commit_hash: " << git_info << "\n";

Nice! In my particular program, I redirect stdout to my log file, so this is sufficient for me…

…Almost.

Uncommitted Code?

Most of you will have noticed by now that there is a problem here. It is assumed, or hoped, that I’m always compiling from committed code. This definitely isn’t always the case during rapid development, but I can definitely constrain myself to only run “proper” experiments using code that I’ve actually committed. Still, I don’t want to confuse myself by incorrectly thinking that some code from a “not proper” experiment is compiled from a certain commit directly.

The fix I chose is the simplest possible option: I will append “-dirty” to the commit hash if the compiled code has not been committed:

#!/usr/bin/bash
commit_hash=$(git rev-parse HEAD)
dirty=$(git diff --quiet || echo "-dirty")
echo "#pragma once"
echo "#define GIT_COMMIT_HASH \"${commit_hash}${dirty}\""

And just to make everything extra explicit (since I don’t want to forget to commit code if I want to run a “proper” experiment), I can add the following:

if (git_info.ends_with("dirty")) {
  std::cout << "note: you're running a build with non-committed changes, "
               "which may limit reproducability\n";
}

The way my code works, this cout runs before stdout starts being redirected to a log file, so I can see this warning on the command-line, allowing me to quickly stop and recompile if I want to. Or I can just run it anyway, if I don’t care.

Improvements

This system works pretty nicely for me. It doesn’t have to be that professional, because it’s just a research project. I doubt anyone else will look at the log files, let alone the code. However, it definitely could be improved.

Mainly, it would be nice to actually record which files are dirty, in the case of a dirty build. This would again be defining a new macro as a list of those files. It could even define it as a C++ vector type, for easy printing!

On a similar note, we only care about commits which modify source code. My repository has a few other files, like some Python scripts to plot the output, and also some configuration files for other things. If those are changed, I don’t want to claim that it’s a dirty build. Working around this would be a bit more work, but if I did have a list of dirty files, I could just check if any of those are in the src/ or include/ directories, for example.

Finally, it could save even richer information, like diffs of the dirty files (so that I could reproduce dirty builds), and even library version numbers.

But for now, this is good enough.