如何减慢程序速度？以及为什么这可能有用

如何减慢程序速度？以及为什么这可能有用
How to slow down a program and why it can be useful

原始链接: https://stefan-marr.de/2025/08/how-to-slow-down-a-program/

## 令人惊讶的有用减速：摘要传统性能研究专注于*加速*程序，但故意减慢程序速度可能会带来意想不到的好处。这种技术有助于检测竞争条件（通过改变指令时序）、模拟性能加速以评估优化效果（使用 Coz 等工具）以及评估分析器的准确性。当前减速方法通常是粗粒度的，例如暂停线程或插入简单操作。这项研究通过将指令直接插入 x86 代码的基本块中来探索*细粒度*减速。挑战在于找到能够持续减慢执行速度的指令，*而不会*被 CPU 优化掉。在 Intel Core i5 处理器上的实验表明，`NOP`（空操作）和 `MOV regX, regX`（将寄存器移动到自身）指令是最可靠的。这些指令引入了大约 100% 的可预测减速，同时保留了程序性能行为，如分析器所观察的那样。这项工作代表着朝着更精确的开发工具迈出了一步，这些工具利用机器码级别的减速来进行调试和优化。进一步的研究将探索这项技术的含义和应用。

一个黑客新闻的讨论探讨了*减慢*程序运行速度的惊人效用。分享了一个链接到详细描述这个概念的文章 (stefan-marr.de)。用户们讨论了即使是“NOP”（空操作）指令也能明显减慢执行速度，这可能是由于流水线限制或架构要求确保指令被实际处理。一位评论员指出，如果没有这种强制延迟，循环内的NOP指令可能会被跳过，从而使其失效。对话还涉及历史例子：Commodore 64的“C64 Snail”，以及早期PC上的“Turbo Button”。有趣的是，“Turbo Button” *最初*被设计用来*减慢*较新的机器的速度，以匹配旧型号的速度并确保与旧软件的兼容性，之后才变成一个主要用于装饰的功能。

原文

Most research on programming language performance asks a variation of a single question: how can we make some specific program faster? Sometimes we may even investigate how we can use less memory. This means a lot of research focuses solely on reducing the amount of resources needed to achieve some computational goal.

So, why on earth might we be interested in slowing down programs then?

Slowing Down Programs is Surprisingly Useful!

Making programs slower can be useful to find race conditions, to simulate speedups, and to assess how accurate profilers are.

To detect race conditions, we may want to use an approach similar to fuzzing. Instead of exploring a program’s implementation by varying its input, we can explore different instruction interleavings, thread or event schedules, by slowing down program parts to change timings. This approach allows us to identify concurrency bugs and is used by CHESS, WAFFLE, and NACD.

The Coz profiler is an example of how slowing down programs can be used to simulate speedup. With Coz, we can estimate whether an optimization is beneficial before implementing it. Coz simulates it by slowing down all other program parts. The part we think might be optimizable stays at the same speed it was before, but is now virtually sped up, which allows us to see whether it gives enough of a benefit to justify a perhaps lengthy optimization project.

And, as mentioned before, we can also use it to assess how accurate profilers are. Though, I’ll leave this for the next blog posts. :)

The current approaches to slowing down programs for these use cases are rather coarse-grained though. Race detection often adapts the scheduler or uses, for example, APIs such as Thread.sleep(). Similarly, Coz pauses the execution of the other threads. Work on measuring whether profilers give actionable results, inserts bytecodes into Java programs to compute Fibonacci numbers.

By using more fine-grained slowdowns, we think we could make race detection, speedup estimation, and profiler accuracy assessments more precise. Thus, we looked into inserting slowdown instructions into basic blocks.

Which x86 Instructions Allow us to Consistently Slow Down Basic Blocks?

Let’s assume we run on some x86 processor, and we are looking at programs from the perspective of processors.

When running a benchmark like Towers, the OpenJDK’s HotSpot JVM may compile it to x86 instructions like this:

1
2
3
4
5
6
7
mov dword ptr [rsp+0x18], r8d
mov dword ptr [rsp], ecx
mov qword ptr [rsp+0x20], rsi
mov ebx, dword ptr [rsi+0x10]
mov r9d, edx
cmp edx, 0x1
jnz 0x... <Block 55>

This is one of the basic blocks produced by HotSpot’s C2 compiler. For our purposes, it suffices to see that there are some memory accesses with the mov instructions, and we end up checking whether the edx register contains the value 1. If that’s not the case, we jump to Block 55. Otherwise, execution continues in the next basic block. A key property of a basic block is that there’s no control flow inside of it, which means once it starts executing, all of its instructions will execute.

Though, how can we slow it down?

x86 has many many different instructions one could try to insert into the block, which each will probably consume CPU cycles. However, modern CPUs try to execute as many instructions as possible at the same time using out-of-order execution. This means, instructions in our basic block that do not directly depend on each other might be executed at the same time. For instance, the first three mov instructions access neither the same register nor memory location. This means the order in which they are executed here does not matter. Though, which optimizations CPUs apply depends on the program and the specific CPU generation, or rather microarchitecture.

To find suitable instructions to slow down basic blocks, we experimented only on an Intel Core i5-10600 CPU, which has the Comet Lake-S microarchitecture. On other microarchitectures, things can be very different.

For the slowdown that we want, we can use nop or mov regX, regX instructions on Comet Lake-S. This mov would move the value from register X to itself, so basically does nothing. These two instructions give us a slowdown that is small enough to slow down most blocks accurately to a desired target speed, and the slowdown seems to affect only the specific block it is meant for.

Our basic block from earlier would then perhaps end up with nop instructions interleaved after each instruction. In practice, the number of instructions we need to insert depends on how much time a basic block takes in the program. Though, for illustration, it might look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
mov dword ptr [rsp+0x18], r8d
nop
mov dword ptr [rsp], ecx
nop
mov qword ptr [rsp+0x20], rsi
nop
mov ebx, dword ptr [rsi+0x10]
nop
mov r9d, edx
nop
cmp edx, 0x1
nop
jnz 0x... <Block 55>

We tried six different candidates, including a push-pop sequence, to get a better impression of how Comet Lake-S deals with them. For more details of how and what we tried, please have a look at our short paper below, which we will present at the VMIL workshop.

When inserting these instructions into basic blocks, so that each individual basic block takes about twice as much time as before, we end up with a program that indeed is overall twice as slow, as one would hope. Even better, when we look at the Towers benchmark with the async-profiler for HotSpot, and compare the proportions of run time it attributes to each method, the slowed-down and the normal version match almost perfectly, as illustrated below. The same is not true for the other candidates we looked at.

**Figure 1:** A scatter plot per slowdown instruction with the median run-time percentage for the top six Java methods of Towers. The *X=Y* diagonal indicates that a method’s run‐time percentage remains the same with and without slowdown.

The paper has a few more details, including a more detailed analysis of the slowdown each candidate introduces, how precise the slowdown is for all basic blocks in the benchmark, and whether it makes a difference when we put the slowdown all at the beginning, interleaved, or at the end.

Of course, this work is merely a stepping stone to more interesting things, which I will look at in a bit more detail in the next post.

Until then, the paper is linked below, and questions, pointers, and suggestions are welcome on Mastodon, BlueSky, or Twitter.

Abstract

Slowing down programs has surprisingly many use cases: it helps finding race conditions, enables speedup estimation, and allows us to assess a profiler’s accuracy. Yet, slowing down a program is complicated because today’s CPUs and runtime systems can optimize execution on the fly, making it challenging to preserve a program’s performance behavior to avoid introducing bias.

We evaluate six x86 instruction candidates for controlled and fine-grained slowdown including NOP, MOV, and PAUSE. We tested each candidate’s ability to achieve an overhead of 100%, to maintain the profiler-observable performance behavior, and whether slowdown placement within basic blocks influences results. On an Intel Core i5-10600, our experiments suggest that only NOP and MOV instructions are suitable. We believe these experiments can guide future research on advanced developer tooling that utilizes fine-granular slowdown at the machine-code level.

Evaluating Candidate Instructions for Reliable Program Slowdown at the Compiler Level: Towards Supporting Fine-Grained Slowdown for Advanced Developer Tooling
H. Burchell, S. Marr; In Proceedings of the 17th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages, VMIL'25, p. 8, ACM, 2025.
Paper: PDF
DOI: 10.1145/3759548.3763374

BibTex: bibtex

@inproceedings{Burchell:2025:SlowCandidates,
  abstract = {Slowing down programs has surprisingly many use cases: it helps finding race conditions, enables speedup estimation, and allows us to assess a profiler's accuracy. Yet, slowing down a program is complicated because today's CPUs and runtime systems can optimize execution on the fly, making it challenging to preserve a program's performance behavior to avoid introducing bias.
  
  We evaluate six x86 instruction candidates for controlled and fine-grained slowdown including NOP, MOV, and PAUSE. We tested each candidate’s ability to achieve an overhead of 100%, to maintain the profiler-observable performance behavior, and whether slowdown placement within basic blocks influences results. On an Intel Core i5-10600, our experiments suggest that only NOP and MOV instructions are suitable. We believe these experiments can guide future research on advanced developer tooling that utilizes fine-granular slowdown at the machine-code level.},
  author = {Burchell, Humphrey and Marr, Stefan},
  booktitle = {Proceedings of the 17th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages},
  doi = {10.1145/3759548.3763374},
  isbn = {979-8-4007-2164-9/2025/10},
  keywords = {Benchmarking HotSpot ISA Instructions Java MeMyPublication assembly evaluation myown slowdown x86},
  location = {Singapore},
  month = oct,
  pages = {8},
  pdf = {https://stefan-marr.de/downloads/vmil25-burchell-marr-evaluating-candidate-instructions-for-reliable-program-slowdown-at-the-compiler-level.pdf},
  publisher = {{ACM}},
  series = {VMIL'25},
  title = {{Evaluating Candidate Instructions for Reliable Program Slowdown at the Compiler Level: Towards Supporting Fine-Grained Slowdown for Advanced Developer Tooling}},
  year = {2025},
  month_numeric = {10}
}