![]() |
|
![]() |
| Because of virtual address translation [1] speed up. When a memory access is made by a program, the CPU must first translate the virtual address to a physical address, by walking a hierarchical data structure called a page table [2]. Walking the page tables is slow, thus CPUs implement a small on-CPU cache of virtual-to-physical translations called a TLB [1]. The TLB has a limited number of entries for each page size. With 4 KiB pages, the contention on this cache is very high, especially if the workload has a very large workingset size, therefore causing frequent cache evictions and slow walk of the page tables. With 2 MiB or 1 GiB pages, there is less contention and more workingset size is covered by the TLB. For example, a TLB with 1024 entries can cover a maximum of 4 MiB of workingset memory. With 2 MiB pages, it can cover up to 2 GiB of workingset memory. Often, the CPU has different number of entries for each page size.
However, it is known that larger page sizes have higher internal fragmentation and thus lead to memory wastage. It's a trade off. But generally speaking, for modern systems, the overhead of managing memory in 4 KiB is very high and we are at a point where switching to 16/64 KiB is almost always a win. 2 MiB is still a bit of a stretch, though, but transparent 2 MiB pages for heap memory is enabled by default on most major Linux distributions, aka THP [2] Source: my PhD is on memory management and address translation on large memory systems, having worked both on hardware architecture of address translation and TLBs as well as the Linux kernel. I'm happy to talk about this all day! [1] https://blogs.vmware.com/vsphere/2020/03/how-is-virtual-memo... [2] https://docs.kernel.org/admin-guide/mm/transhuge.html |
![]() |
| There's probably no good reason to put code and data on the same page, it's just one extra TLB entry to use two pages instead so the data page can be marked non-executable. |
![]() |
| It's nice for type 1 hypervisors when carving up memory for guests. When page walks for guest virtual to host physical end up taking sixteen levels, a 1G page short circuits that in half to eight. |
![]() |
| Chrome browser on Android uses the same code base as Chrome on desktop including multi-process architecture. But it’s UI is in Java communicating with C++ using JNI. |
![]() |
| If you use a database library that does mmap to create a db file with SC_PAGE_SIZE (4KB) pages, and then upgrade your device to a 16KB one and backup/restore the app, now your data isn't readable. |
![]() |
| TIL about MapViewOfFile3 and NtMapViewOfSectionEx, thanks! Still, the Microsoft docs say[1]:
> [in, optional] BaseAddress > The desired base address of the view (the address is rounded down to the nearest 64k boundary). > [...] > [in] Offset > The offset from the beginning of the section. > The offset must be 64k aligned. The peculiar part is where base address and offset must be divisible by 64K (also referred to as the “allocation granularity”) but the size only needs to be divisible by the page size. Maybe you’re right and the docs are wrong?.. [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryap... |
![]() |
| The support for mTHP exists in upstream Linux, but the swap story is not quite there yet. THP availability also needs work and there are a few competing directions.
Supporting multiple page sizes well transparently is non-trivial. For a recent summary on one of the approaches, TAO (THP Allocation Optimization), see this lwn article: https://lwn.net/Articles/974636/ |
![]() |
| It should not break userland. GNU/Linux (not necessarily Android though) has supported 64K pages pretty much from the start because that was the originally page size chosen for server-focus kernels and distributions. But there are some things that need to be worked around.
Certain build processes determine the page size at compile time and assume it's the same at run time, and fail if it is not: https://github.com/jemalloc/jemalloc/issues/467 Some memory-mapped files formats have assumptions about page granularity: https://bugzilla.redhat.com/show_bug.cgi?id=1979804 The file format issue applies to ELF as well. Some people patch their toolchains (or use suitable linker options) to produce slightly smaller binaries that can only be loaded if the page size is 4K, even though the ABI is pretty clear in that you should link for compatibility with up to 64K pages. |
![]() |
| That this isn't the only 4K→16K transition in recent history? Some programs that assumed 4K had to be fixed as part of the transition, this can provide insights for the work required for Android. |
![]() |
| apple's m-series chips use a 16kb page size by default so the state of things has improved significantly with software wanting to support asahi and other related endeavors |
![]() |
| I didn't realize they had reverted it, I used to run RHEL builds on Pi systems to test for 64k page bugs because it's not like there's a POWER SBC I could buy for this. |
![]() |
| I wonder how much help they had by asahi doing a lot of the kernel and ecosystem work anablibg 16k pages.
RISC-V being fixed to 4k pages seems to be a bit of an oversight as well. |
![]() |
| It's pretty cool that I can read "anablibg" and know that means "enabling." The brain is pretty neat. I wonder if LLMs would get it too. They probably would. |
![]() |
| I remember reading somewhere that LLMs are actually fantastic at reading heavily mistyped sentences! Mistyped to a level where humans actually struggle.
(I will update this comment if I find a source) |
![]() |
| Svnapot is a poor solution to the problem.
On one hand it means that that each page table entry takes up half a cache line for the 16KB case, and two whole cache lines in the 64KB case. This really cuts down on the page walker hardware's ability to effectively prefetch TLB entries, leading to basically the same issues as this classic discussion about why tree based page tables are generally more effective than hash based page tables (shifted forward in time to today's gate counts). https://yarchive.net/comp/linux/page_tables.html This is why ARM shifted from a Svnapot like solution to the "translation granule queryable and partially selectable at runtime" solution. Another issue is the fact that a big reason to switch to 16KB or even 64KB pages is to allow for more address range for VIPT caches. You want to allow high performance implementations to be able to look up the cache line while performing the TLB lookup in parallel, then compare the tag with the result of the TLB lookup. This means that practically only the untranslated bits of the address can be used by the set selection portion of the cache lookup. When you have 12 bits untranslated in a address, combined with 64 byte cachelines gives you 64 sets, multiply that by 8 ways and you get the 32KB L1 caches very common in systems with 4KB page sizes (sometimes with some heroic effort to throw a ton of transistors/power at the problem to make a 64KB cache by essentially duplicating large parts of the cache lookup hardware for that extra bit of address). What you really want is for the arch to be able to disallow 4KB pages like on apple silicon which is the main piece that allows their giant 128LB and 192KB L1 caches. |
![]() |
|
|
![]() |
| Microsoft wanted to make x86 compatibility as painless as possible. They adopted an ABI in which registers can be generally mapped 1:1 between the two architectures. |
![]() |
| > 5-10% performance boost sounds huge. Wouldn't we have much larger TLBd if page walk was really this expensive?
It's pretty typical for large programs to spend 15+% of their "CPU time" waiting for the TLB. [1] So larger pages really help, including changing the base 4 KiB -> 16 KiB (4x reduction in TLB pressure) and using 2 MiB huge pages (512x reduction where it works out). I've also wondered why the TLB isn't larger. > On the other hand 9% increase in memory usage also sounds huge. How did this affect memory usage that much? This is the granularity at which physical memory is assigned, and there are a lot of reasons most of a page might be wasted: * The heap allocator will typically cram many things together in a page, but it might say only use a given page for allocations in a certain size range, so not all allocations will snuggle in next to each other. * Program stacks each use at least one distinct page of physical RAM because they're placed in distinct virtual address ranges with guard pages between. So if you have 1,024 threads, they used at least 4 MiB of RAM with 4 KiB pages, 16 MiB of RAM with 16 KiB pages. * Anything from the filesystem that is cached in RAM ends up in the page cache, and true to the name, it has page granularity. So caching a 1-byte file would take 4 KiB before, 16 KiB after. [1] If you have an Intel CPU, toplev is particularly nice for pointing this kind of thing out. https://github.com/andikleen/pmu-tools |
![]() |
| Because the whole thing sounds like they’re doing something new, but they’re just catching up to something Apple has done back when they switched to aarch64. |
This will likely reveal bugs that need fixing in some of the 70,000+ packages in the Debian archive.
That ARM64 16KiB page size is interesting in respect of the Apple M1 where Asahi [0] identified that the DART IOMMU has a minimum page size of 16KiB so using that page size as a minimum for everything is going to be more efficient.
[0] https://asahilinux.org/2021/10/progress-report-september-202...