现代化交换：虚拟交换空间

现代化交换：虚拟交换空间
Modernizing swapping: virtual swap spaces

## Linux 内核交换子系统改进最近的开发重点是彻底改进 Linux 内核的交换子系统，旨在提高性能和灵活性。当前的交换机制将页面绑定到特定设备，在移除设备或使用 zswap（一种基于压缩的交换方法）时会产生效率低下。一个提议的解决方案引入了“虚拟交换空间”——一个独立于底层设备的单个统一交换表。这允许页面在设备之间无缝移动，并解决了 zswap 预分配未使用存储的需求。虽然前景可观，但这种方法会增加内存使用量，并显示出性能下降，需要进一步改进。与此同时，另一组补丁提出了“交换层级”，使管理员能够优先使用更快的存储进行交换。这补充了虚拟交换空间的概念，可能简化页面在层级之间的移动。这些变化表明开发人员重新关注交换子系统，寻求提高性能、可维护性和整体效率。然而，关于开销和性能仍然存在担忧，这意味着在集成之前需要进一步开发。

这个Hacker News讨论集中在Windows和Linux之间的内存管理差异。用户报告称，Windows在处理大量内存使用和交换时更流畅，即使交换数GB数据也能保持响应。相反，Linux在类似压力下可能会变得迟缓且难以控制。核心问题似乎在于Linux不愿杀死进程以释放内存，而是倾向于通过交换来解决。然而，评论员指出这是Linux可配置的一个方面。建议使用`earlyoom`（主动杀死进程）和`systemd-oomd`等解决方案来提高响应速度。总的来说，Linux在内存管理方面提供了灵活性，允许用户定制系统行为，但需要主动配置才能与Windows的“开箱即用”体验相匹配。

We're bad at marketing
We can admit it, marketing is not our strong suit. Our strength is writing the kind of articles that developers, administrators, and free-software supporters depend on to know what is going on in the Linux world. Please subscribe today to help us keep doing that, and so we don’t have to get good at marketing.

By Jonathan Corbet
February 19, 2026

The kernel's unloved but performance-critical swapping subsystem has been undergoing multiple rounds of improvement in recent times. Recent articles have described the addition of the swap table as a new way of representing the state of the swap cache, and the removal of the swap map as the way of tracking swap space. Work in this area is not done, though; this series from Nhat Pham addresses a number of swap-related problems by replacing the new swap table structures with a single, virtual swap space.

The problem with swap entries

As a reminder, a "swap entry" identifies a slot on a swap device that can be used to hold a page of data. It is a 64-bit value split into two fields: the device index (called the "type" within the code), and an offset within the device. When an anonymous page is pushed out to a swap device, the associated swap entry is stored into all page-table entries referring to that page. Using that entry, the kernel can quickly locate a swapped-out page when that page needs to be faulted back into RAM.

The "swap table" is, in truth, a set of tables, one for each swap device in the system. The transition to swap tables has simplified the kernel considerably, but the current design of swap entries and swap tables ties swapped-out pages firmly to a specific device. That creates some pain for system administrators and designers.

As a simple example, consider the removal of a swap device. Clearly, before the device can be removed, all pages of data stored on that device must be faulted back into RAM; there is no getting around that. But there is the additional problem of the page-table entries pointing to a swap slot that no longer exists once the device is gone. To resolve that problem, the kernel must, at removal time, scan through all of the anonymous page-table entries in the system and update them to the page's new location. That is not a fast process.

This design also, as Pham describes, creates trouble for users of the zswap subsystem. Zswap works by intercepting pages during the swap-out process and, rather than writing them to disk, compresses them and stores the result back into memory. It is well integrated with the rest of the swapping subsystem, and can be an effective way of extending memory capacity on a system. When the in-memory space fills, zswap is able to push pages out to the backing device.

The problem is that the kernel must be able to swap those pages back in quickly, regardless of whether they are still in zswap or have been pushed to slower storage. For this reason, zswap hides behind the index of the backing device; the same swap entry is used whether the page is in RAM or on the backing device. For this trick to work, though, the slot in the backing device must be allocated at the beginning, when a page is first put into zswap. So every zswap usage must include space on a backing device, even if the intent is to never actually store pages on disk. That leads to a lot of wasted storage space and makes zswap difficult or impossible to use on systems where that space is not available to waste.

Virtual swap spaces

The solution that Pham proposes, as is so often the case in this field, is to add another layer of indirection. That means the replacement of the per-device swap tables with a single swap table that is independent of the underlying device. When a page is added to the swap cache, an entry from this table is allocated for it; the swap-entry type is now just a single integer offset. The table itself is an array of swp_desc structures:

    struct swp_desc {
        union {
            swp_slot_t         slot;
            struct zswap_entry * zswap_entry;
        };
        union {
            struct folio *     swap_cache;
            void *             shadow;
        };
        unsigned int               swap_count;
        unsigned short             memcgid:16;
        bool                       in_swapcache:1;
        enum swap_type             type:2;
    };

The first union tells the system where to find a swapped-out page; it either points to a device-specific swap slot or an entry in the zswap cache. It is the mapping between the virtual swap slot and a real location. The second union contains either the location of the page in RAM (or, more precisely, its folio) or the shadow information used by the memory-management subsystem to track how quickly pages are faulted back in. The swap_count field tracks how many page-table entries refer to this swap slot, while in_swapcache is set when a page is assigned to the slot. The control group (if any) managing this allocation is noted in memcgid.

The type field tells the kernel what type of mapping is currently represented by this swap slot. If it is VSWAP_SWAPFILE, the virtual slot maps to a physical slot (identified by the slot field) on a swap device. If, instead, it is VSWAP_ZERO, it represents a swapped-out page that was filled with zeroes that need not be stored anywhere. VSWAP_ZSWAP identifies a slot in the zswap subsystem (pointed to by zswap_entry), and VSWAP_FOLIO is for a page (indicated by swap_cache) that is currently resident in RAM.

The big advantage of this arrangement is that a page can move easily from one swap device to another. A zswap page can be pushed out to a storage device, for example, and all that needs to change is a pair of fields in the swp_desc structure. The slot in that storage device need not be assigned until a decision to push the page out is made; if a given page is never pushed out, it will not need a slot in the storage device at all. If a swap device is removed, a bunch of swp_desc entries will need to be changed, but there will be no need to go scanning through page tables, since the virtual swap slots will be the same.

The cost comes in the form of increased memory usage and complexity. The swap table is one 64-bit word per swap entry; the swp_desc structure triples that size. Pham points out that the added memory overhead is less than it seems, since this structure holds other information that is stored elsewhere in current kernels. Still, it is a significant increase in memory usage in a subsystem whose purpose is to make memory available for other uses. This code also shows performance regressions on various benchmarks, though those have improved considerably from previous versions of the patch set.

Still, while the value of this work is evident, it is not yet obvious that it can clear the bar for merging. Kairui Song, who has done the bulk of the swap-related work described in the previous articles, has expressed concerns about the memory overhead and how the system performs under pressure. Chris Li also worries about the overhead and said that the series is too focused on improving zswap at the expense of other swap methods. So it seems likely that this work will need to see a number of rounds of further development to reach a point where it is more widely considered acceptable.

Postscript: swap tiers

There is a separate project that appears to be entirely independent from the implementation of the virtual swap space, but which might combine well with it: the swap tiers patch set from Youngjun Park. In short, this series allows administrators to configure multiple swap devices into tiers; high-performance devices would go into one tier, while slower devices would go into another. The kernel will prefer to swap to the faster tiers when space is available. There is a set of control-group hooks to allow the administrator to control which tiers any given group of processes is allowed to use, so latency-sensitive (or higher-paying) workloads could be given exclusive access to the faster swap devices.

A virtual swap table would clearly complement this arrangement. Zswap is already a special case of tiered swapping; Park's infrastructure would make it more general. Movement of pages between tiers would become relatively easy, allowing cold data to be pushed in the direction of slower storage. So it would not be surprising to see this patch series and the virtual swap space eventually become tied together in some way, assuming that both sets of patches continue to advance.

In general, the kernel's swapping subsystem has recently seen more attention than it has received in years. There is clearly interest in improving the performance and flexibility of swapping while making the code more maintainable in the long run. The days when developers feared to tread in this part of the memory-management subsystem appear to have passed.

to post comments As it stands, zswap is a great improvement but, due to its implementation details (frontswap), it cannot go further. Couldn't this virtual swap space just point to zswap entries regardless if they are in RAM or on the backing device? If such an entry is encountered, have zswap decompress it, i.e. go through frontswap in reverse? This is much like when zswap gets disabled at runtime: new folios will just go directly to and from the backing device but the already compressed ones are kept and treated accordingly. In the long run, I think zswap could become the default that way. Don't want compression? Set some dummy compressor to get legacy behavior. Is all this just too much complexity/overhead, or do kernel devs have such approach on their wish list as well? I think, for the purposes of zram, plain tmpfs with a path through zswap would suffice, unless I am missing some real use cases that require the block layer. Think of it, folios read from zram need to be decompressed, so the reader can use them. That means that some pages exist in compressed and uncompressed form at the same time, wasting memory; or maybe they replace one another? Doesn't really matter. My point is that tmpfs was almost there but from the other direction; pages live uncompressed in the page cache until they get swapped out. Just make that swapping go through zswap and zram is obsolete. The added bonus being that hot pages, regardless if they are anonymous or tmpfs or whatever, naturally stay in the "hot" uncompressed region of RAM, and the rest get (z)swapped out, as per usual. Whereas with swap on zram, the pages are already considered cold. And other uses of zram cannot easily make use of swap, because that might just be the same device they are coming from; kernel say: reclaim pages from zram0, swap subsystem writes pages to swap0 which resides on zram0! And just like that, you ruptured the space-time-continuum, by letting Marty McFly prevent his father from getting together with his mother, or some such. ;) And, of course, that zswap quirk, to require pre-allocating space, (almost) never to be touched, needs to be fixed. But, in my head at least, it could all be way simpler that way. I think the authors of these patches may suffer from tunnel vision, so I am just throwing this out there as food for thought. Maybe it's a more recent change? I haven't followed any changes since I discovered the noswap mount option. Since my heavy tmpfs usage involves incompressible data exclusively, I now prefer capping the tmpfs size somewhere close to but below the physical RAM - was double that before - and have other pages zswapped in their place, when the need arises. Otherwise, it'd be reject_compress_fail galore for tmpfs pages, anyway, and I'd like to save as much I/O as possible from happening. I'll tolerate the odd rejected page, but not on the gigabyte order of magnitude, on top of the outcome being clear before zswap even tried. My use case is only slightly worse off for the limitation of available space. Depends on how recent we are talking. Linux 5.10 old enough? That may very well be the case. I am running Ubuntu LTS which is not exactly bleeding edge. Thanks for finally giving me a definitive answer to that question!