垃圾回收中CPU-内存关系剖析 (OpenJDK 26)
Dissecting the CPU-memory relationship in garbage collection (OpenJDK 26)

原始链接: https://norlinder.nu/posts/GC-Cost-CPU-vs-Memory/

## Java垃圾回收的演变成本 数十年以来,Java的垃圾回收(GC)一直自动管理内存,使开发者摆脱了手动生命周期管理。然而,这种便利是以CPU周期为代价的。传统上,GC性能通过暂停时间来衡量,但随着GC算法的演进,这个指标变得越来越不可靠。 现代GC引入了复杂性:*显性成本*(专门用于GC任务的CPU周期)、*隐性成本*(注入到应用程序代码中的屏障)和*微架构效应*(缓存影响)。并行GC用CPU换取更短的暂停时间,而像G1和ZGC这样的并发收集器则将工作转移到后台,掩盖了总CPU开销。ZGC旨在实现最小的暂停时间,但并未消除工作,只是将其分摊。 这种转变意味着暂停时间不再能准确反映GC效率。Amdahl定律进一步限制了并行化的好处。为了解决这个问题,OpenJDK 26引入了新的API——通过`-Xlog:cpu`进行统一日志记录,以及`MemoryMXBean.getTotalGcCpuTime()`方法——以提供对GC显性CPU成本的精确核算。 这些工具能够做出关于堆大小和GC算法选择的明智决策,从而超越了对暂停时间进行反应式优化,转向主动资源管理。通过暴露真实的计算成本,开发者和研究人员可以同时优化吞吐量和延迟,最终实现更高效、更具成本效益的Java应用程序。

一位来自OpenJDK的JVM工程师在OpenJDK 26中开发了一个新的遥测框架,以更好地理解和量化垃圾回收(GC)的CPU开销。作者在博士研究期间研究过GC,发现现有的工具不足以应对现代并发收集器,因为仅靠暂停时间无法揭示完整的性能影响。 新的API允许开发者精确测量与GC相关的CPU成本,特别是CPU使用率和内存管理之间的权衡。这解决了性能分析中的一个盲点,超越了单纯的暂停时间,还包括对象遍历、对象移动、线程暂停和内存屏障等成本。 一位评论者强调了该接口在跟踪GC相关问题方面的实用性,并询问如何将GC影响与应用程序线程性能相关联,建议与OpenTelemetry集成,并将GC时间添加到span中以进行更好的数据分析。作者可以回答关于文章和实现的问题。
相关文章

原文

1. Background

Since the popularization of garbage collection (GC) in Lisp almost 70 years ago, managed runtimes have provided developers a with kind of magic: automatic memory management. This freed programmers from managing complex lifecycle management. This, along with many other ideas, influenced the design of Smalltalk. Following this lineage, Smalltalk was also one of several languages that inspired the authors of Java, the language and runtime I spend my days improving.

While the programmer was liberated, the CPU was not. The GC now sat on the critical path to reclaim memory, accruing a debt that could not be deferred forever. For decades, settling this debt meant pausing the application entirely, or “stopping the world” in GC parlance. The collector would stop the application, scan the heap to identify and reclaim reusable memory. In the single-core era, the pause time served as a reliable proxy for machine load.

1.1. The GC Cost Taxonomy

To reason about the performance implications of GC, we need to decompose it into three dimensions as depicted in Figure 1.

Application GC Thread(s) Your Source Code void update(Node n) { n.next = newNode; } Actual Execution if (GC.isMarking) { GC.enqueue(n.next); // Pre-Barrier } n.next = newNode; GC.updateCard(n); // Post-Barrier CPU L3 Cache Hot App Data GC Data Cold GC scans evict "Hot" Application Data, causing cache misses when App resumes.
  1. Explicit GC cost

    The CPU cycles consumed by dedicated GC threads performing tasks such as: traversing the object graph to find live data, relocating memory to free space, or updating references.

  2. Implicit GC cost

    Code may be injected directly into the application to support specific GC capabilities. These are often referred to as barriers and are required for features such as reference counting, tracking object age (generations), or ensuring heap consistency when objects move concurrently.

  3. Microarchitectural effects

    GC also impacts the memory subsystem. It can degrade performance by evicting application data from CPU caches or, alternatively, enhance it by rearranging objects to improve spatial locality.

Measuring the implicit GC cost is difficult. Blackburn and Hosking (2004) [1] augmented Jikes RVM (a VM optimized for research) to establish a baseline without barriers for comparison. However, such approaches do not easily lend themselves to a performance-optimized VM like OpenJDK.

As I will show next, the components of explicit GC cost have expanded, making GC pauses a less powerful proxy for computational efficiency, while our tools to measure them have not. In Section 2, I present the new Java API in JDK 26 for querying a GC’s explicit cost.

1.2. The Single-Threaded Pause

In OpenJDK, Serial GC exemplifies the classical single-core approach: when the heap is full, application execution halts entirely while the collector reclaims space. As Figure 2 illustrates, this mechanism effectively converts memory pressure into paused time.

Computer Science 101: Wall-Clock vs. CPU Time

Wall-clock time measures the elapsed duration of execution. CPU time quantifies the aggregate time the CPU was actively executing the application.

In a single-threaded, compute-bound scenario, these metrics converge. Conversely, in multi-core environments, they decouple. The ratio \(\frac{\text{CPU time}}{\text{wall-clock time}}\) approximates the average number of cores utilized during execution. This distinction is critical for performance analysis: it decouples responsiveness from efficiency.

This visible cost drove an obsession with pause times across both industry and academia. Because long pauses were so destructive, we spent decades engineering them away. We leveraged the generational hypothesis to segment objects by age [2] and built dashboards to alert on every pause time spike. The definition was strict: application time is productive; pause time is overhead. This mental model enabled developers to reason about performance costs as a batch processing equation.

The batch processing mental model also clarified the fundamental trade-off: memory buys throughput. Expanding the heap allows the JVM to defer collection, reducing the cumulative cost of pauses. Conversely, constraining memory forces the collector to intervene more frequently, burning CPU cycles just to keep the application afloat (Figure 3).

However, Figure 3 reveals where this abstraction fractures. First, throughput is not determined solely by pause time. Every entry into a GC cycle incurs a safepoint penalty—the CPU cost of synchronizing threads to a halt. At high frequencies, this administrative overhead accumulates, leading to observable overhead in application execution. Second, the mapping between pause time and user latency breaks down. As the interval between GC cycles shrinks, an application’s function is statistically more likely to be interrupted multiple times. As noted by [3], this compounding latency means a user’s experience is no longer bounded by the duration of a single stop, but by the sum of a chain of interruptions.

To see what this means in practice, imagine a web server handling requests during a busy period. When memory pressure is high and GC cycles are frequent, even short pauses can accumulate. A single HTTP request may arrive just before a GC pause starts and then, before it finishes processing, be interrupted again by the next pause. This chain of brief stutters can turn what should be a smooth interaction into a frustrating wait for the user, as their request is repeatedly delayed behind internal housekeeping. Suddenly, the user’s experience isn’t limited by pause time, but by unpredictable total disruption caused by these overlapping safepoint costs.

1.3. The Multi-Threaded Pause

The arrival of multi-core CPUs provided more workers, presenting two fundamental design options: brute-force the pause (parallelism) or run alongside the application (concurrency). While more cores offer the potential for better performance, any cores that remain idle during parts of the application’s execution still incur costs, especially in a cloud environment where billing is based on provisioned resources. Hence, provisioning inefficiency directly translates into higher operational expenses, as organizations pay for the time extra CPUs spend waiting for the next burst of work. Making efficient use of every core is a technical concern as well as a budgetary one.

Parallel GC uses parallelism to reduce the pause time. It is essentially a multi-threaded evolution of Serial GC, defaulting to utilize all available cores to minimize the pause duration. This effectively allowed developers to apply a parallelized batch processing mental model to reason about how the GC trades CPU cycles for memory.

Consider the single-threaded workload from Figure 2, re-deployed on a dual-core instance using Parallel GC in Figure 4. By distributing reclamation work across both cores, the collector halves the pause duration, yielding a 5% net boost in throughput.

The trade is explicit: we leverage hardware parallelism to reduce the stop-the-world window. Crucially, the total CPU time for GC remains constant; the work is simply parallelized, not eliminated. However, this introduces a provisioning inefficiency: the second core remains idle during the single-threaded application phase, utilized only to accelerate the cleanup.

1.4. From Batch Processing to Background Work

While Parallel GC reduced the pause, its pause time remains bounded by the size of the live set and Amdahl’s Law. Amdahl’s Law [4], depicted in Figure 5, defines the theoretical upper bound on speedup. In an ideal world (100% parallel), 20 cores purchase a 20x speedup. But introduce just 1% serial execution (99% parallel), and the currency devalues: 20 cores yield only 17x. At 64 cores, the return collapses to just 39x.

Think of this as hardware inflation. You are paying for 64 cores, but the purchasing power of that silicon has eroded by nearly 40%. The cost of speed inflates until the currency—additional cores—becomes practically worthless. Consequently, relying solely on parallelizing the GC pause is a dead end. Physics dictates that the serial bottleneck will eventually dominate; the pause time problem cannot be solved by simply buying more hardware.

1.5. G1: Shifting to Background Work

To further minimize pause times, G1 [5] (among other strategies) shifts work from the pause to run concurrently with the application, i.e., in the background. Figure 6 shows the result: the pause duration is significantly reduced. However, if we estimate the explicit GC cost by only measuring CPU usage during the pause, we overlook the total cost. In this workload, 79% of the GC’s CPU time was spent during concurrent phases, consuming resources while the application was running.

While Parallel GC relies solely on the parallelized batch processing model, G1 is a hybrid. It combines parallelized batch processing with background work. Because of this split, the pause time metric becomes an incomplete measure of a GC’s explicit cost. It no longer fails only in edge cases (such as high GC frequency in Figure 3); it now systematically underestimates the collector’s explicit cost.

1.6. ZGC: Decoupling GC Pause from Overhead

As Figure 7 indicates, ZGC performs virtually all heavy lifting concurrently, including object relocation, and achieves sub-millisecond pauses regardless of heap size.

With ZGC, the correlation between pause duration and GC overhead is effectively decoupled. The work has not vanished; it has been amortized across background threads and the application threads themselves (via load barriers). Consequently, relying on pause time to quantify ZGC’s cost is incorrect.

1.7. Summary

The correlation between GC pause time and machine resources has weakened with every generation, creating an operational blind spot. Parallel GC introduces provisioning inefficiency, halving the pause only by doubling the CPU cost (Figure 4), while G1 conceals throughput overhead by ignoring the 79% of cycles shifted to background threads (Figure 6). ZGC effectively decouples the metrics entirely; sub-millisecond latency no longer implies low computational effort.

As noted by Kanev et al. [6], in data centers, a substantial fraction of CPU cycles is spent on low-level operations, such as serialization and memory allocation. In managed runtimes, the GC is a dominant driver of this tax. Hassanein [7] corroborated this in Google’s production Java fleet (powering latency-critical services such as Gmail), demonstrating that GC CPU utilization directly translates into substantial hardware and power costs.

Crucially, merely measuring the process’s total CPU time is insufficient. While standard tools capture the aggregate bill, they lack attribution. They cannot distinguish between a compute-intensive application, an aggressive JIT compiler, or a struggling GC. Without isolating the GC’s specific contribution, we cannot understand the efficiency of our memory configuration. Not as a developer debugging performance or as researchers trying to develop the next generation of GC algorithms. We need a precise, internal accounting of the collector’s work.

This brings us to OpenJDK 26.

2. Explicit GC Cost Accounting Through Standard Java API

With OpenJDK 26, I have introduced two new mechanisms to quantify explicit GC costs: unified logging via -Xlog:cpu (printed during JVM exit) and the Java API method MemoryMXBean.getTotalGcCpuTime(). Underlying both is the new cpuTimeUsage.hpp framework, which provides support for any GC implementation within OpenJDK.

Researchers and engineers performing performance analysis/benchmarks can implement the pattern demonstrated below to extract this new telemetry. Measuring GC overhead on a per-iteration basis isolates the workload, effectively disregarding irrelevant noise generated during JVM startup (unless startup latency is the active subject of analysis). Below is an example of how it can be utilized.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import com.sun.management.OperatingSystemMXBean;
import java.lang.management.ManagementFactory;
import java.lang.management.MemoryMXBean;
import java.util.concurrent.Executors;
import java.util.stream.IntStream;

public class Main {
    static final MemoryMXBean memoryBean = ManagementFactory.getPlatformMXBean(MemoryMXBean.class);
    static final OperatingSystemMXBean osBean = ManagementFactory.getPlatformMXBean(OperatingSystemMXBean.class);

    static void main(){
        // Run 10 iterations to account for JIT warmup etc.
        for (int i = 0; i < 10; i++) {
            long start = System.nanoTime();
            long startGC = memoryBean.getTotalGcCpuTime();
            long startProcess = osBean.getProcessCpuTime();

            try (var executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors())) {
                IntStream.range(0, 100000).forEach(_ -> {
                    App app = new App();
                    executor.submit(app::critical);
                });
            }

            long end = System.nanoTime();
            long endGC = memoryBean.getTotalGcCpuTime();
            long endProcess = osBean.getProcessCpuTime();

            long duration = end - start;
            long gcCPU = endGC - startGC;
            long processCPU = endProcess - startProcess;

            System.out.println("GC used " + String.format("%.2f", 1.0 * gcCPU / duration) + " cores");
            System.out.println("Process used " + String.format("%.2f", 1.0 * processCPU / duration) + " cores");
            System.out.println("GC used " + (int)(100.0 * gcCPU / processCPU) + " % of total CPU spend");
            System.out.println("---------------------------------");
        }

    }
}

class App {
    byte[] a;
    void critical() {
        a = new byte[100000];
    }
}

Sampling getTotalGcCpuTime and getProcessCpuTime twice provides the deltas. The ratio of these deltas (gcCPU / processCPU) yields the explicit GC cost as a percentage of total CPU time.

Measuring CPU Time on Short-Running Applications

The JVM relies on the operating system’s CPU time accounting. Consequently, for very short-running processes (e.g., a few milliseconds), the results may be unreliable.

3. Applying CPU Cost Accounting to xalan and Spring

To contextualize these metrics, the xalan and Spring workloads from the DaCapo benchmark suite were instrumented using the telemetry pattern demonstrated above. Evaluations were performed on an Intel Xeon Gold 6354 (18 cores, 36 hardware threads, 39 MB LLC), applying the default workload provisioning in DaCapo of one application thread per hardware thread. As will become evident, neither application saturates all 36 available hardware threads. Process utilization at smaller heap sizes indicates the opposite of a stressed system: a low number of cores in use. This is due to GC occupying the critical path. In these situations, pause times have historically served as a proxy for GC stress, but the true computational cost can finally be revealed.

Figure 8 illustrates the CPU-memory tradeoff in xalan. Performance correlates with memory scarcity. We observe a performance cliff at 39 MB, with massive gains, followed by rapidly diminishing returns. Beyond this threshold, Amdahl’s Law dominates: process CPU usage continues to climb, yet throughput improvements are negligible. There is no universally “correct” GC CPU overhead—spending 79% of your CPU on GC (like Parallel in a 19 MB heap) might be perfectly acceptable if your primary constraint is memory footprint and you are willing to accept a low resilience to any increase in load. But now, that is a conscious business decision rather than a silent operational leak.

G1 utilization follows a non-linear relationship here. Interestingly, at the smallest heap size, G1 requires 65% less CPU than Parallel GC while delivering equivalent throughput. While ZGC requires more baseline memory headroom at these constrained heap sizes, it achieves parity with G1 and Parallel when given sufficient memory. This is not a deficiency, but a deliberate design tradeoff: we have exchanged memory footprint for minimal application latency.

In Figure 9, analyzing the Spring PetClinic application, the dynamic shifts are shown. At heap sizes of 202 MB and 405 MB, G1 consumes approximately 3.5x more CPU to maintain throughput—a stark contrast to the efficiency seen in xalan. ZGC again approaches the performance of Parallel and G1 as heap size increases. However, at 405 MB, ZGC’s CPU utilization is capped by a “storm” of allocation stalls. This represents a known anti-pattern for concurrent collectors: insufficient headroom forces the linearization of relocation work, stalling application threads.

4. Conclusion

For too long, understanding the explicit CPU overhead of GC has required invasive profiling, custom builds, or educated guessing. With OpenJDK 26, we have democratized this data. The inclusion of MemoryMXBean.getTotalGcCpuTime() and -Xlog:cpu exposes the explicit GC cost as a tangible, observable metric.

I urge both the academic and engineering communities to adopt these standard APIs.

For researchers, this offers a standardized baseline for reporting overhead, reducing the noise in comparative studies. For engineers, it provides the observability needed to tune the application heap and to detect when you have hit the wall of Amdahl’s Law—before you throw more hardware at a software problem.

The tools are now in the JDK. Let’s use them to bring rigorous accounting to our production systems and our papers.

5. References

[1] S. M. Blackburn and A. L. Hosking, “Barriers: Friend or Foe?,” in ISMM, 2004. PDF DOI
[2] D. Ungar, “Generation Scavenging: A Non-Disruptive High Performance Storage Reclamation Algorithm,” in SDE 1, 1984. PDF DOI
[3] P. Cheng and G. E. Blelloch, “A Parallel, Real-Time Garbage Collector,” in PLDI, 2001. PDF DOI
[4] G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” in AFIPS, 1967. PDF DOI
[5] D. Detlefs, C. Flood, S. Heller, and T. Printezis, “Garbage-first garbage collection,” in ISMM, 2004. PDF DOI
[6] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks, “Profiling a warehouse-scale computer,” in ISCA, 2015. PDF DOI
[7] W. Hassanein, “Understanding and Improving JVM GC Work Stealing at the Data Center Scale,” in ISMM, 2016. PDF DOI
联系我们 contact @ memedata.com