ZGC 如何为 Java 堆分配内存
How ZGC allocates memory for the Java heap

原始链接: https://joelsiks.com/posts/zgc-heap-memory-allocation/

ZGC是OpenJDK中的一个垃圾收集器,它使用称为页面的逻辑区域来管理Java堆内存,这些页面分为小型(2MB)、中型(4-32MB)和大型(动态大小,>4MB)三种。页面分配器将堆划分为多个分区,单系统使用一个分区,NUMA系统使用多个分区,这些分区与NUMA节点对齐,以加快内存访问速度。 ZGC将物理内存和虚拟内存解耦,以对抗碎片化,它预留的虚拟内存最多可以达到最大堆大小的16倍(NUMA系统为32倍)。映射缓存使用侵入式存储将未使用的映射内存存储在一个红黑树中,优先考虑连续的范围。 分配内存包括从缓存中索取容量、通过提交内存来增加容量或收集(重新映射)较小的范围。当单个分区容量不足时,多分区分配会在多个分区中索取内存。 可以禁用取消提交内存以保证一致的延迟,或者使用`-XX:ZUncommitDelay=`启用它以减小内存占用。`-Xms`、`-Xmx`和`-XX:SoftMaxHeapSize`等标志用于配置堆大小,`-XX:+AlwaysPreTouch`确保初始内存被访问,以加快执行速度。

这篇 Hacker News 讨论串探讨了 ZGC(一款 Java 垃圾收集器)如何为 Java 堆分配内存,重点关注 64 位架构带来的内存管理技巧。 要点包括: * **自动堆大小调整:** 即将到来的更改旨在自动调整 ZGC 的堆大小。 * **内存映射技巧:** ZGC 使用 32:1 的虚拟内存与物理内存比例,从而实现内存重定位和彩色指针(指针中的标志位)。数据被放入 64 位地址空间中的 44 位。 * **利用 64 位寻址:** 64 位地址允许使用诸如内存别名之类的技术,以降低 GC 成本;并通过共享地址空间和仅交换访问权限来提高 L4 微内核的 IPC 速度,避免 TLB 抖动。 * **权限隔离:** 用户提出了一些 CPU 原语,用于在单个进程和地址空间内进行权限隔离,从而能够安全地执行可能不受信任的代码,而无需昂贵的上下文切换。 讨论还涉及硬件级内存保护机制的想法以及安全性和性能之间的权衡。

原文

This post explores how ZGC, one of the garbage collectors in the OpenJDK, allocates memory for the Java heap, focusing on enhancements introduced in JDK-8350441 with the Mapped Cache. A garbage collector does much more than just collect garbage - and that’s what I want to unpack in this post. Whether you’re a Java nerd yearning for details, a GC enthusiast, or just curious about how ZGC uses memory behind the scenes, this deep dive is for you.

The features described in this post are lined up for the release of OpenJDK 25. You can download the latest version of the OpenJDK here. To see the latest version of ZGC, you should look at the mainline source code. Some particularly relevant source files include:

The content in this post is generally applicable to all operating systems and platforms that ZGC runs on. Additional Linux-specific details are mentioned where relevant. For example, ZGC only supports NUMA on Linux, not on Windows and BSD.

In ZGC, memory for the Java heap is organized into logical regions known as pages. These should not be confused with operating system (OS) pages. From this point on, any reference to pages refers specifically to ZGC pages unless explicitly stated otherwise.

Pages are categorized into three size classes: Small, Medium, and Large. This classification helps optimize memory usage and allocation strategies based on object size.

  • Small pages are 2 MB and are used for most regular object allocations. In a typical Java program, most allocations are small and will end up on a Small page.
  • Medium pages are generally 32 MB and are intended for larger objects that don’t quite qualify as “large”.
  • Large pages are dynamically sized and are used for very large allocations, typically over 4 MB. Only a single object is stored on a Large page.

Memory for the Java heap, and in turn pages, is managed by the Page Allocator. The main part of allocating a page is getting hold of its underlying memory, which the page uses to store objects. Memory is a finite resource and many strategies are used to make sure that memory can be successfully allocated. The smallest size that ZGC works with for its underlying memory is 2 MB, which we refer to as a Granule, and also the size for the memory of a Small page.

When discussing heap size in the context of ZGC, we typically use the term capacity rather than heap size, even though most documentation uses heap size. The current capacity is defined as the current heap size, which may grow and shrink between some defined boundaries. These boundaries can be set explicitly via the command-line, or implicitly by the JVM based on system resources. There are two key flags that can be used to explicitly configure the heap size: Minimum Heap Size (-Xms) and Maximum Heap Size (-Xmx). Out of all available memory on the host computer, the Java heap is allowed to use memory inside these boundaries.

[---------|-----------------------------|-----------]
          ^                             ^
   Minimum Heap Size            Maximum Heap Size
       (-Xms)                        (-Xmx)

Note: Setting the heap size using -Xms sets both the initial AND minimum heap size. Most of the time this is likely the desired configuration. If you want a different minimum and initial heap size, you can use the -XX:MinHeapSize flag. Make sure you set it after -Xms so that -Xms does not override the value set by -XX:MinHeapSize.

The minimum heap size (min capacity) and maximum heap size (max capacity) may be set to the same value, in which case the heap is fixed and will never resize.

[-------------------------|-------------------------]
                          ^
          Fixed Heap Size (min = max)
                        (-Xms = -Xmx)

For example, the following command starts ZGC with a minimum (and initial) heap size of 512 MB and a maximum heap size of 8 GB:

$ java -XX:+UseZGC -Xms512M -Xmx8G <java file>

Now that we’ve looked at how to set the heap size and how it’s organized into pages, let’s zoom in on how those pages are managed and allocated by the Page Allocator.

Partitions

The Page Allocator, which manages the heap, maintains a number of partitions. Each partition represents a subset of the Java heap. On most systems, the entire heap is managed as a single partition. However, some systems may divide the heap into multiple partitions, with each one managing a portion of the total heap capacity.

Single Partition

The single partition covers the entire Java heap. It will have min and max capacity as set by -Xms and -Xmx.

Heap Capacity: from -Xms to -Xmx
+--------------------------------------------------+
|                    Partition 0                   |
|          (entire heap managed as one)            |
+--------------------------------------------------+

Multiple Partitions

With multiple partitions, the Java heap is divided evenly so that the sum of all partitions equals the total heap size. Each partition gets an equal share of both the min and max capacity, and is allowed to grow and shrink independently within its own boundaries.

Heap Capacity: from -Xms to -Xmx (evenly divided)
+------------------+------------------+------------------+
|   Partition 0    |   Partition 1    |   Partition 2    |
|   min = Xms/3    |   min = Xms/3    |   min = Xms/3    |
|   max = Xmx/3    |   max = Xmx/3    |   max = Xmx/3    |
+------------------+------------------+------------------+

Multiple partitions are currently only enabled when running on a system with NUMA (Non-Uniform Memory Access) architecture. NUMA is a memory design that provides faster access to memory that is locally attached to a processor (or NUMA node), while access to memory attached to other processors (remote nodes) is slower. When running on a NUMA system and NUMA is enabled (using the -XX:+UseNUMA flag, which is enabled by default), each partition will correspond to a specific NUMA node. As a result, the number of partitions will match the number of NUMA nodes. For example, if a NUMA system has 4 NUMA nodes, ZGC will divide the heap into 4 partitions. This approach allows ZGC to allocate memory locally on the NUMA node closest to the processor requesting memory, likely improving memory access speed and boosting performance.

The connection between a partition and a specific NUMA node is only a logical connection, and might not always reflect the truth. In some edge-cases, memory might not be available on the NUMA node that matches the partition an allocation is made on. This might be due to other processes on the system using memory unevenly across NUMA nodes, leading to imbalances. In such cases, memory will be allocated on another NUMA node in order to succeed with the allocation. This strategy is called preferred allocation, see MPOL_PREFERRED in the Linux kernel docs for more details.

NUMA support in ZGC is not new, but it has been reworked with the introduction of the Mapped Cache (JDK-8350441). When NUMA is turned off, either explicitly by the user or because the system does not support NUMA, only a single partition is used.

Memory

Let’s zoom in even further into how memory is allocated and used for a partition. ZGC is unique among the GCs in HotSpot in that it separates physical and virtual memory. The following sections highlight how physical and virtual memory are handled differently and what problems they face separately before being used for an allocation.

Physical Memory

Physical memory usually refers to the hardware RAM available on the system, which is a finite resource. It may be limited either artificially, or by other tasks using memory on the computer. In ZGC, physical memory is directly tied to the heap capacity, which is represented within a partition. Physical memory can exist in one of three key states: committed and mapped, committed, and not committed. The diagram below show the states that physical memory may transition to and from.

(Committed + Mapped) <-> (Committed) <-> (Not Committed)

Committed memory refers to memory that has been reserved for use by the application and is guaranteed to be backed by physical storage (usually RAM in this case). When memory is committed, the system ensures that there is enough physical space available and reserves it to the application process. To access committed physical memory, it must be mapped to virtual memory. ZGC only tracks memory that is either committed and mapped, or not committed. Memory that is committed but not yet mapped is considered to be an intermediate state and will either be mapped or uncommitted shortly thereafter. When allocating memory, ZGC ensures that it has corresponding virtual memory before mapping it. This means that any memory that is successfully committed is immediately mapped afterward, and committed memory never has to be uncommitted during the allocation. Mapping memory is expected to always succeed, see Additional Notes for a more detailed explanation.

The capacity tracked within a partition represents how much memory the partition is allowed to commit. In the case of a single partition, setting the minimum heap size (-Xms) to 1 GB means at least 1 GB must be committed. Conversely, setting the maximum heap size (-Xmx) to 8 GB means that no more than 8 GB may be committed. Below is an illustration of the terms related to capacity that are tracked within a partition. Apart from the previously mentioned min and max capacity, a partition also keeps track of current capacity and current max capacity, which both move in between the min and max boundaries. Increasing capacity means that new memory is being committed, and current capacity is increased, while decreasing capacity means that memory is being uncommitted, and current capacity is decreased. Current max capacity always decreases, never grows, and is initially set to the same value as max capacity. It is decreased if committing new memory fails.

[--------|--------#-------------@----------|--------]
        min    current     current_max    max

Virtual Memory

The main reason ZGC decouples physical and virtual memory is to combat fragmentation, which can hinder the ability to allocate contiguous virtual memory. By decoupling memory, virtual memory can be over-reserved, meaning more virtual memory is reserved than there is available physical memory (i.e., capacity). This increases the likelihood of finding a contiguous range of virtual memory that can be mapped to physical memory during an allocation. By default, ZGC reserves virtual memory up to 16 times the maximum heap size, evenly split across all partitions.

The minimum requirement when reserving virtual memory is to get at least as much as the maximum heap size. This ensures physical memory can be mapped one-to-one with virtual memory, any less would prevent all physical memory from being used.

On systems with multiple partitions (i.e., NUMA systems with NUMA enabled), ZGC attempts to reserve 32 times the maximum heap size. The extra 16x is used for a special kind of allocation called a multi-partition allocation (described in Multi-Partition Allocation).

Over-reserving virtual memory makes it easier to deal with fragmentation, but in some cases it might not be possible to reserve as much virtual memory as we would like. This is especially noticeable on systems with large amounts of physical memory, where 32x, or even 16x, may not be possible. As a way to alleviate fragmentation in those cases, and for programs with adversarial allocation patterns, ZGC actively defragments the heap when pages are freed. Defragmentation works by re-mapping physical memory to new virtual addresses located in lower memory regions, where the goal is to “fill holes” to create larger contiguous ranges. Currently, only Large pages are defragmented when freed.

Mapped Cache

As mentioned earlier in Physical Memory, a partition tracks memory that is either not committed or both committed and mapped. Memory that is not committed is implicitly tracked as unused capacity and can be “allocated” by increasing capacity and committing new memory. Memory that has been committed and mapped but is not currently in use by any page is stored in the Mapped Cache. The term “Mapped” in Mapped Cache refers to the fact that it stores mapped memory.

The Mapped Cache uses a self-balancing binary search tree (red black tree) to store ranges of mapped memory. Since the tree stores unused mapped memory, it can use this memory to store data about itself, eliminating the need for dynamic memory allocation (malloc), which could negatively affect latency during page allocation. This type of storage is known as intrusive storage.

The Mapped Cache aims to maintain the largest contiguous memory ranges possible by merging adjacent virtual memory upon insertion. Additionally, to speed up the search for contiguous memory during allocations, the Mapped Cache tracks several size classes. Each size class contains entries in the tree that are larger than a designated size. The impact of size classes is noticeable in memory allocations for Medium and/or Large pages.

As with all other areas of ZGC, the minimum working size in the Mapped Cache is a Granule (i.e., 2 MB). As such, when removing memory for a Small page from the cache, the first node in the tree can always be used, as long as the tree is not empty. This means that allocations for Small pages never need to search the tree for an entry large enough.

Allocation

The first step in allocating memory is claiming capacity, which primarily involves securing enough physical memory. It may also include virtual memory if we manage to obtain mapped and committed memory from the Mapped Cache (referred to as the cache from here on out).

When claiming capacity, we first attempt to get contiguous memory from the cache, which is already mapped, committed and of the correct size, and can be used for a page immediately. This is outcome (1), which is the most common fast-path for Small pages, and is always successful when the cache is not empty.

(1) Contiguous memory from the cache

|-----------------------------------------|
|                 cache                   |

If the cache does not contain a contiguous range large enough, we move on to increasing capacity. Increasing capacity means we need to commit new memory. This is outcome (2).

(2) Only increased capacity

|-----------------------------------------|
|           increased capacity            |

If capacity cannot be increased at all, this means that we have already committed all the memory we’re allowed to (as indicated by the current max capacity). And since we did not get contiguous memory that was large enough from the cache, we move on to removing smaller ranges from the cache, such that they add up to the requested size. This is referred to as “harvesting”, and is outcome (3).

(3) Only harvested

|-----------------------------------------|
|-|-|-|-|----------------|-|-|------|-----|
|               harvested                 |

If the current capacity is near the current max capacity, and we’ve increased capacity to its limit, but it does not cover the requested size, we perform harvesting in addition to increasing capacity. This means that some memory will be committed, as part of increasing capacity, and some part will be harvested from memory in the cache. This is outcome (4).

(4) Combination of harvested and increased capacity

|-----------------------------------------|
|-|-|-----|---------|---------------------|
|     harvested     | increased capacity  | 

If enough capacity is not available either in the cache or from increasing capacity, the allocation fails. The first time capacity claiming fails, a so-called “allocation stall” is triggered. An allocation stall means that the allocating thread will trigger a minor GC and stall until memory is (hopefully) freed up, at which point claiming capacity can be attempted again. If the allocation fails a second time, an OutOfMemoryError (OOME) is thrown.

Before moving on to committing new memory or harvesting, ZGC ensures that virtual memory is available. This is so that memory can be mapped just after committing, since ZGC doesn’t track memory that is committed but not mapped. In most cases, claiming virtual memory is successful, but might fail due to fragmentation preventing a sufficiently large contiguous range from being found. This is generally only a problem for Large pages. If virtual memory cannot be acquired, an OOME is thrown.

Increasing Capacity (Committing)

If an allocation increases capacity, either for the entire allocation or just part of it, new memory is committed. Committing memory from the OS is usually successful, but might fail. If committing memory fails, we record this by decreasing the current max capacity, preventing future attempts to commit that memory. Importantly, ZGC does not assume that memory will become available again in the future. Once the max capacity is lowered, it stays lowered. This conservative approach is based on the fact that it’s hard to predict the behavior of other applications on the system, which may or may not release memory later on.

Because increasing capacity occurs before harvesting from the cache, ZGC “retries” the allocation using the updated current max capacity. This makes subsequent allocations claim physical memory from the cache instead of increasing capacity. A retry may succeed if the cache contains enough memory to successfully harvest, see section below.

If the commit succeeds, the newly committed memory is immediately mapped to the virtual memory that was claimed earlier, completing the capacity increase.

Harvesting (Remapping)

In this context, harvesting refers to claiming physical memory (i.e., capacity) from multiple memory ranges that are already mapped and committed but not contiguous in virtual memory. Since virtual memory for a page must be contiguous, the harvested regions need to be remapped (i.e., unmapped and then mapped) into a single contiguous range.

To improve the likelihood of harvesting being successful, the virtual memory from harvested ranges is “reused” when claiming virtual memory. First, the harvested ranges are unmapped to insert their virtual memory to the pool of available virtual memory. Subsequently, a contiguous virtual memory range is claimed. If claiming succeeds, the unmapped ranges are mapped to the new contiguous virtual memory range, completing the harvest.

If a contiguous range cannot be claimed, the harvested physical memory must be mapped back to virtual memory and inserted to the cache, since ZGC does not track committed memory that is committed but not mapped. If harvesting fails, an OOME is thrown.

The act of harvesting is likely to negatively impact allocation latency, particularly for Medium pages. While Small pages are never harvested and Large pages already involve high allocation costs, Medium pages suffer more due to their smaller size, where the relative cost of multiple system calls, at worst unmapping many 2 MB ranges, can become significant.

Sorting Physical Memory

Just as virtual memory can become fragmented, physical memory can also, especially when harvesting, with or without increased capacity. This is generally not an issue, as it is entirely possible (and somewhat reasonable) for physical memory to be fragmented. Only when accessing do you need a contiguous virtual memory range. When physical memory becomes fragmented, the OS maps the contiguous virtual memory in increments to contiguous physical memory. Even though the contiguous virtual memory is presented as a single mapping, the truth may be that the OS stores many mappings (known as VMA in Linux).

To reduce pressure on the OS, physical memory is sorted just before mapping virtual memory to it. In the best case, this reduces the number of mappings required by the OS. This is especially impactful for systems with a large heap, where the maximum number of mappings may not be appropriately configured by default. See Additional Notes for more information.

Multi-Partition Allocation

The process described in the above section(s) is what happens in a so called single-partition allocation, where capacity is claimed only from a single partition in the Page Allocator. However, there might be cases where enough capacity is not left in any single partition to satisfy an allocation. In such cases, memory might be available in total, spread across multiple partitions. To not throw an OOME prematurely, when the heap actually contains enough memory, a so-called multi-partition allocation is attempted.

Note: Multi-partition allocations are only enabled when the Page Allocator has multiple partitions (i.e., NUMA is enabled) and when the full 32x virtual memory reservation is successful (see Virtual Memory for more details).

In a multi-partition allocation, the same operation of claiming capacity is done, but once for each partition. Initially, an even amount of capacity is attempted to be claimed from each partition, as to not create a bias toward any particular partition. If an even split is not possible, the remaining capacity is greedily claimed from each partition round-robin style until the allocation is satisfied.

The illustration below shows a multi-partition allocation that has claimed capacity from three partitions (0, 1, and 2):

|-----------------------------------------|
|----(0)----|-------(1)-------|----(2)----|
|        multi-partition allocation       |

By default, ZGC periodically uncommits memory that is not currently in use by pages, that is, memory stored in the cache. This behavior is particularly useful in applications or environments where memory footprint is a concern. However, in most use cases, uncommitting comes at a cost: it effectively undoes the expensive work of committing and mapping memory, as described in earlier sections. As a result, future allocations may suffer increased latency, since this work must be done all over again.

Uncommitting can be explicitly disabled by using the -XX:-ZUncommit flag, or by setting the minimum heap size (-Xms) and maximum heap size (-Xmx) to the same value. In this case, the heap becomes fixed in size, and uncommitting no longer makes sense (see Background).

Uncommitting is performed periodically based on the -XX:ZUncommitDelay=<seconds> flag, which is set by default to 300 seconds (5 minutes). When uncommitting, the Page Allocator considers memory usage in the cache since the previous uncommit and may release any amount of memory that it considers unused during that time.

When talking about memory in ZGC, there is a trade-off between startup time and latency during program execution. Setting the minimum and maximum heap sizes to the same value causes all memory to be committed at startup. If the minimum and maximum heap sizes are not set to the same value, memory may be committed during runtime, which will affect allocation latency negatively, as committing needs to be done on-the-fly.

Additionally, committing memory only guarantees that it will be backed by physical memory at some point in the future. To ensure that memory is actually backed upon committing, it must be “touched”. By pre-touching memory during startup using the -XX:+AlwaysPreTouch flag, all the initially committed memory is touched at the start. This again introduces a trade-off between faster startup and potential allocation latency during execution.

An effect of using shared memory on Linux is that ZGC must immediately back physical memory, regardless of whether pre-touching is enabled or not. This makes using the -XX:+AlwaysPreTouch flag redundant on Linux, though I still recommend enabling it in case this behavior changes in the future.

If latency is a critical concern, you should disable uncommit, either explicitly by using the -XX:-ZUncommit flag, or implicitly by setting the minimum and maximum heap sizes to the same value.

  • For most heap sizes (i.e., any size ≥ 1 GB), Medium pages are typically 32 MB. However, the actual Medium page size is determined by rounding down to the power of two closest to 3.125% of the heap size, within the interval [4 MB, 32 MB]. If the calculated size is less than 4 MB, Medium pages are disabled, and only Small and Large pages are used. The possible Medium page sizes are: 4 MB, 8 MB, 16 MB, and 32 MB. You can see what size you’ve ended up with for Medium pages in the GC log:

    $ java -XX:+UseZGC -Xlog:gc+init=info --version
    ...
    [0.001s][info][gc,init] Medium Page Size: 32M
    ...
    
  • In addition to setting the -Xmx and -Xms flags, the -XX:SoftMaxHeapSize flag might also be useful to configure. ZGC will use the soft max heap size as the limit for its heuristics, but if it can’t keep the heap size below this limit, it is allowed to temporarily use up to the maximum heap size, set by -Xmx. See the GC tuning guide.

  • This post does not cover the use of Large Pages in the OS (such as THP and HugeTLB on Linux). For guidance on using this feature with ZGC, refer to the ZGC wiki. One important note: Transparent Huge Pages (THP) often require additional configuration to work correctly with ZGC on most Linux distributions.

  • Mapping memory is expected to always succeed. If it fails, ZGC will crash. While this is rarely an issue, it’s worth noting that, for historical reasons, the maximum number of memory mappings on Linux is by default set to 65530, which is relatively low.

联系我们 contact @ memedata.com