（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41330852

应用程序应调整其内存行为以优化透明大页 (THP) 的优势。目标是最大限度地减少小型内存事务，因为它们会在虚拟到物理的转换过程中造成过多的开销，从而导致速度减慢。为了实现这一目标，可以采取多种方法： 1.显式方法：利用“hugetlbfs”，一种专为大内存分配而设计的特殊文件系统。应用程序可以通过该接口请求特定的大小，确保获得所需的大页面。但这种方法需要开发人员手动干预，并未被广泛采用。 2.隐式方法：利用Linux内核提供的透明HugePages。通过启用此功能，操作系统会尽可能自动为 mmap 调用分配大页面，从而为内存密集型应用程序提供显着改进。要启用透明大页，请发出命令“echo never > /sys/kernel/mm/transparent_hugepage/defrag”，然后发出“echo madvise >> /proc/sys/vm/compactpages”。然后，根据系统的可用资源修改“/proc/sys/vm/hugepages_total”和“/proc/sys/vm/hugepages_surphide”设置。或者，您可以通过保留默认设置来让系统管理大页面。使用大页面时，请务必注意库和垃圾收集器可能需要调整以处理更大的内存单元。较小的页面大小（例如数据库使用的页面大小）在迁移到透明大页面时可能需要仔细考虑。除此之外，根据请求的页面大小，TLB 大小也有所不同。用户空间可以自由分配内存，最多可达相应 TLB 支持的最大值。至于论文，不幸的是我找不到公开版本。

In Debian kernel we've very recently enabled building an ARM64 kernel flavour with 16KiB page size and we've discussed adding a 64KiB flavour at some point as is the case for PowerPC64 already.

This will likely reveal bugs that need fixing in some of the 70,000+ packages in the Debian archive.

That ARM64 16KiB page size is interesting in respect of the Apple M1 where Asahi [0] identified that the DART IOMMU has a minimum page size of 16KiB so using that page size as a minimum for everything is going to be more efficient.

[0] https://asahilinux.org/2021/10/progress-report-september-202...

> The very first 16 KB enabled Android system will be made available on select devices as a developer option. This is so you can use the developer option to test and fix

> once an application is fixed to be page size agnostic, the same application binary can run on both 4 KB and 16 KB devices

I am curious about this. When could an app NOT be agnostic to this? Like what an app must be doing to cause this to be noticeable?

The fundamental problem is that system headers don't provide enough information. In particular, many programs need both "min runtime page size" and "max runtime page size" (and by this I mean non-huge pages).

If you call `mmap` without constraint, you need to assume the result will be aligned to at least "min runtime page size". In practice it is probably safe to assume 4K for this for "normal" systems, but I've seen it down to 128 bytes on some embedded systems, and I don't have much breadth there (this will break many programs though, since there are more errno values than that). I don't know enough about SPARC binary compatibility to know if it's safe to push this up to 8K for certain targets.

But if you want to call `mmap` (etc.) with full constraint, you must work in terms of "max runtime page size". This is known to be up to at least 64K in the wild (aarch64), but some architectures have "huge" pages not much beyond that so I'm not sure (256K, 512K, and 1M; beyond that is almost certainly going to be considered huge pages).

Besides a C macro, these values also need to be baked into the object file and the linker needs to prevent incompatible assumptions (just in case a new microarchitecture changes them)

1G huge pages had (have?) performance benefits on managed runtimes for certain scenarios (Both the JIT code cache and the GC space saw uplift on the SpecJ benchmarks if I recall correctly)

If using relatively large quantities of memory 2M should enable much higher TLB hit rates assuming the CPU doesn't do something silly like only having 4 slots for pages larger than 4k ¬.¬

Because of virtual address translation [1] speed up. When a memory access is made by a program, the CPU must first translate the virtual address to a physical address, by walking a hierarchical data structure called a page table [2]. Walking the page tables is slow, thus CPUs implement a small on-CPU cache of virtual-to-physical translations called a TLB [1]. The TLB has a limited number of entries for each page size. With 4 KiB pages, the contention on this cache is very high, especially if the workload has a very large workingset size, therefore causing frequent cache evictions and slow walk of the page tables. With 2 MiB or 1 GiB pages, there is less contention and more workingset size is covered by the TLB. For example, a TLB with 1024 entries can cover a maximum of 4 MiB of workingset memory. With 2 MiB pages, it can cover up to 2 GiB of workingset memory. Often, the CPU has different number of entries for each page size.

However, it is known that larger page sizes have higher internal fragmentation and thus lead to memory wastage. It's a trade off. But generally speaking, for modern systems, the overhead of managing memory in 4 KiB is very high and we are at a point where switching to 16/64 KiB is almost always a win. 2 MiB is still a bit of a stretch, though, but transparent 2 MiB pages for heap memory is enabled by default on most major Linux distributions, aka THP [2]

Source: my PhD is on memory management and address translation on large memory systems, having worked both on hardware architecture of address translation and TLBs as well as the Linux kernel. I'm happy to talk about this all day!

[1] https://blogs.vmware.com/vsphere/2020/03/how-is-virtual-memo... [2] https://docs.kernel.org/admin-guide/mm/transhuge.html

> I'm happy to talk about this all day!

Oh really :)

I'd like to ask how applications should change their memory allocation or usage patterns to maximise the benefit of THP. Do memory allocators (glibc mainly) need config tweaking to coalesce tiny mallocs into 2MB+ mmaps, will they just always do that automatically, do you need to use a custom pool allocator so you're doing large allocations, or are you never going to get the full benefit of huge tables without madvise/libhugetlbfs? And does this apply to Mac/Windows/*BSD at all?

[Edit: ouch, I see /sys/kernel/mm/transparent_hugepage/enabled is default set to 'madvise' on my system (Slackware) and as a result doing nearly nothing. But I saw it enabled in the past. Well that answers a lot of my questions: got to use madvise/libhugetlbfs.]

I read you also need to ensure ELF segments are properly aligned to get transparent huge pages for code/data.

Another question. From your link [2]:

> An application may mmap a large region but only touch 1 byte of it, in that case a 2M page might be allocated instead of a 4k page for no good.

Do the heuristics used by Linux THP (khugepaged) really allow completely ignoring whether pages have actually been page-faulted in or even initialised? Is a possibility unlikely to happen in practice?

In current Linux systems, there are two main ways to benefit from huge pages. 1) There is the explicit, user-managed approach via hugetlbfs. That's not very common. 2) Transparently managed by the kernel via THP (userpsace is completely unaware and any application using mmap() and malloc() can benefit from that).

As I mentioned before, most major Linux distributions ship with THP enabled by default. THP automatically allocates huge pages for mmap memory whenever possible (that is when, at least, the region is 2 MiB aligned and is at least 2 MiB in size). There is also a separate kernel thread, khugepaged, that opportunistically tries to coalesce/promote base 4K pages into 2 MiB huge pages, whenever possible.

Library support is not really required for THP, but could be detrimental for its performance and availability on the long run. A library that is not aware of kernel huge pages may employ suboptimal memory management strategies, resulting in inefficient utilization, for example by unintentionally breaking those huge pages (e.g. via unaligned unmapping), or failing to properly release them to the OS as one full unit, undermining their availability on the long run. Afaik, Tcmalloc from Google is the only library with extensive huge page awareness [1].

> Do the heuristics used by Linux THP (khugepaged) really allow completely ignoring whether pages have actually been page-faulted in or even initialised? Is a possibility unlikely to happen in practice?

Linux allocates huge pages on first touch. For khugepaged, it only coalesces the pages if all the base pages covering the 2 MiB virtual region exist in some form (not necessary faulted-in. For example, some of those base pages could be in swap space and Linux will first fault them in then migrate them)

[1] https://www.usenix.org/system/files/osdi21-hunter.pdf

Mimalloc has support for huge pages. It also has an option to reserve 1GB pages on program start, and I've had very good performance results using that setting and replacing factorio's allocator. On windows and linux.

Thanks!

> I'm happy to talk about this all day!

With noobs, too? ;)

> Often, the CPU has different number of entries for each page size.

- Does it mean userspace is free to allocate up to a maximum of 1G? I took pages to have a fixed size.

- Or, you mean CPUs reserve TLB sizes depending on the requested page size?

> With 2 MiB or 1 GiB pages, there is less contention and more workingset size is covered by the TLB

- Would memory allocators / GCs need to be changed to deal with blocks of 1G? Would you say, the current ones found in popular runtimes/implementations are adept at doing so?

- Does it not adversely affect databases accustomed to smaller page sizes now finding themselves paging in 1G at once?

> my PhD is on memory management and address translation on large memory systems

If the dissertation is public, please do link it, if you're comfortable doing so.

> - Does it mean userspace is free to allocate up to a maximum of 1G? I took pages to have a fixed size. > - Or, you mean CPUs reserve TLB sizes depending on the requested page size?

The TLB is a hardware cache with a limited number of entries that cannot dynamically change. Your CPU is shipped with a fixed number of entries dedicated for each page size. Translations of base 4 KiB pages could, for example, have 1024 entries. Translations of 2 MiB pages could have 512 entries and those of 1 GiB usually have a very limited number of only 8 or 16. Nowadays, most CPU vendors increased their 2 MiB TLBs to have the same number of entries dedicated for 4 KiB pages.

If you're wondering why they have to be separate caches, it's because, for any page in memory, you can have both mappings at the same time from different processes or different parts of the same process, with possibly different protections.

> - Would memory allocators / GCs need to be changed to deal with blocks of 1G? Would you say, the current ones found in popular runtimes/implementations are adept at doing so?

> - Does it not adversely affect databases accustomed to smaller page sizes now finding themselves paging in 1G at once?

Runtimes and databases have full control and Linux allows per-process policies via madvise) system call. If a program is not happy with huge pages, it can ask the kernel to be ignored, as it can choose to be cooperative.

> If the dissertation is public, please do link it, if you're comfortable doing so.

I'm still in the PhD process, so no cookies atm :D

I think modern Intel/AMD have same amount of dTLB entries for all page sizes. For example a modern CPU with 3k TLB entries one can access at max: - 12MB with 4k page size - 6GB with 2M page size - 3TB with 1G page size

If the working set per core is bigger than above numbers you get 10-20% slower memory accesses due to TLB miss penalty.

Quoting Windows Internals 7th Edt Part 1:

    There is an unfortunate side effect of large pages. Each page (whether huge, large, or small) must be mapped with a single protection that applies to the entire page. This is because hardware memory protection is on a per-page basis. If a large page contains, for example, both read-only code and read/write data, the page must be marked as read/write, meaning that the code will be writable. As a result, device drivers or other kernel-mode code could, either maliciously or due to a bug, modify what is supposed to be read-only operating system or driver code without causing a memory access violation.

There's probably no good reason to put code and data on the same page, it's just one extra TLB entry to use two pages instead so the data page can be marked non-executable.

It's nice for type 1 hypervisors when carving up memory for guests. When page walks for guest virtual to host physical end up taking sixteen levels, a 1G page short circuits that in half to eight.

Chrome browser on Android uses the same code base as Chrome on desktop including multi-process architecture. But it’s UI is in Java communicating with C++ using JNI.

Flutter/Dart, yes, React Native/Javascript, no. With RN the app's code runs via an embedded JavaScript engine, and even when, say, Hermes is being used, it's still executing bytecode not native machine code.

Also important to note that any code that runs on Android's ART runtime (i.e. Kotlin and/or Java) can get some or all of its code AOT-compiled to machine code by the OS, either upon app install (if the app ships with baseline profiles) or in the background while the device is idle and charging.

If you use a database library that does mmap to create a db file with SC_PAGE_SIZE (4KB) pages, and then upgrade your device to a 16KB one and backup/restore the app, now your data isn't readable.

Pages sizes are often important to code that relies on low-level details of the environment it’s running in, like language runtimes. They might do things like mark some sections of code as writable or executable and thus would need to know what the granularity of those requests can be. It’s also of importance to things like allocators that hand out memory backed by mmap pages. If they have, say, a bit field for each 16-byte region of a page that has been used that will change in size in ways they can detect.

> When could an app NOT be agnostic to this

When the app has a custom memory allocator, the allocator might have hardcoded the page size for performance. Otherwise you have to load a static variable (knocks out a cache line you could've used for something else) and then do a multiplication (or bit shift, if you assume power of 2) by a runtime value instead of a shift by a constant, which can be slower.

No idea if Android apps are ever this performance sensitive, though.

I don't know if this fits but I've seen code that allocated say 32 bytes from a function that allocated 1meg under the hood. Not knowing that's what was happening the app quickly ran out of memory. It arguably was not the app's fault. The API it was calling into was poorly designed and poorly named, such that the fact that you might need to know the block size to use the function was in no way indicated by the name of the function nor the names of any of its parameters.

Only on Android, for what it's worth; most "vanilla" Linux aarch64 linkers chose 64K defaults several years ago. But yes, most Android applications with native (NDK) binaries will need to be rebuilt with the new 16kb max-page-size.

4K, 2M ("large page"), or 1G ("huge page") on x86-64. A single allocation request can consist of multiple page sizes. From Windows Internal 7th Edt Part 1:

    On Windows 10 version 1607 x64 and Server 2016 systems, large pages may also be mapped with huge pages, which are 1 GB in size. This is done automatically if the allocation size requested is larger than 1 GB, but it does not have to be a multiple of 1 GB. For example, an allocation of 1040 MB would result in using one huge page (1024 MB) plus 8 “normal” large pages (16 MB divided by 2 MB).

TIL about MapViewOfFile3 and NtMapViewOfSectionEx, thanks! Still, the Microsoft docs say[1]:

> [in, optional] BaseAddress

> The desired base address of the view (the address is rounded down to the nearest 64k boundary).

> [...]

> [in] Offset

> The offset from the beginning of the section.

> The offset must be 64k aligned.

The peculiar part is where base address and offset must be divisible by 64K (also referred to as the “allocation granularity”) but the size only needs to be divisible by the page size. Maybe you’re right and the docs are wrong?..

[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryap...

Seems pretty dubious to do this without adding support for having both 4KB and 16KB processes at once to the Linux kernel, since it means all old binaries break and emulators which emulate normal systems with 4KB pages (Wine, console emulators, etc.) might dramatically lose performance if they need to emulate the MMU.

Hopefully they don't actually ship a 16KB default before supporting 4KB pages as well in the same kernel.

Also it would probably be reasonable, along with making the Linux kernel change, to design CPUs where you can configure a 16KB pagetable entry to map at 4KB granularity and pagefault after the first 4KB or 8KB (requires 3 extra bits per PTE or 2 if coalesced with the invalid bit), so that memory can be saved by allocating 4KB/8KB pages when 16KB would have wasted padding.

Having both 4KB and 16KB simultaneously is either easy or hard depending on which hardware feature they are using for 16KB pages.

If they are using the configurable granule size, then that is a system-wide hardware configuration option. You literally can not map at smaller granularity while that bit is set.

You might be able to design a CPU that allows your idea of partial pages, but there be dragons.

If they are not configuring the granule size, instead opting for software enforcement in conjunction with always using the contiguous hint bit, then it might be possible.

However, I am pretty sure they are talking about hardware granule size, since the contiguous hint is most commonly used to support 16 contiguous entrys (though the CPU designer is technically allowed to do whatever grouping they want) which would be 64KB.

I’m a total idiot, how exactly is page size a CPU issue rather than a kernel issue? Is it about memory channel protocols / communication?

Disks have been slowly migrating away from the 4kb sector size, is this a same thing going on? That you need to actual drive to support it, because of internal structuring (i.e. how exactly the CPU aligns things in RAM), and on some super low level 4kb / 16kb being the smallest unit of memory you can allocate?

And does that then mean that there’s less overhead in all kinds of memory (pre)fetchers in the CPU, because more can be achieved in less clock cycles?

Each OS process has its own virtual address space, which is why one process cannot read another's memory. The CPU implements these address spaces in hardware, since literally every memory read or write needs to have its address translated from virtual to physical.

The CPU's address translation process relies on tables that the OS sets up. For instance, one table entry might say that the 4K memory chunk with virtual address 0x21000-0x21fff maps to physical address 0xf56e3000, and is both executable and read-only. So yes, the OS sets up the tables, but the hardware implements the protection.

Since memory protection is a hardware feature, the hardware needs to decide how fine-grained the pages are. It's possible to build a CPU with byte-level protection, but this would be crazy-inefficient. Bigger pages mean less translation work, but they can also create more wasted space. Sizes in the 4K-64K range seem to offer good tradeoffs for everyday workloads.

> I’m a total idiot, how exactly is page size a CPU issue rather than a kernel issue?

Because the size of a page is a hardware defined size for Intel and ARM CPU's (well, more modern Intel and ARM CPU's give the OS a choice of sizes from a small set of options).

It (page size) is baked into the CPU hardware.

> And does that then mean that there’s less overhead in all kinds of memory (pre)fetchers in the CPU, because more can be achieved in less clock cycles?

For the same size TLB (Translation Look-aside Buffer -- the CPU hardware that stores the "referencing info" for the currently active set of pages being used by the code running on the CPU) a larger page size allows more total memory to be accessible before taking a page fault and having to replace one or more of the entries in the TLB. So yes, it means less overhead, because CPU cycles are not used up in replacing as many TLB entries as often.

Samsung SSD still reports to the system that their logical sector size is 512 bytes. In fact one of the recent models even removed the option to reconfigure the disk to use 4k logical sectors. Presumably Samsung has figured that since the physical sector is much larger and they need complex mapping of logic sectors in any case, they decided not to support 4K option and stick with 512 bytes.

The CPU has hardware that does a page table walk automatically when you access an address for which the translation is not cached in the TLB. Otherwise virtual memory would be really slow.

Since the CPU hardware itself is doing the page table walk it needs to understand page tables and page table entries etc. including how big pages are.

Also you need to know how big pages are for the TLB itself.

The value of 4kB itself is pretty much arbitrary. It has to be a small enough number that you don't waste a load of memory by mapping memory that isn't used (e.g. if you ask for 4.01kB you're actually going to get 8kB), but a large enough number that you aren't spending all your time managing tiny pages.

That's why increasing the page size makes things faster but waste more memory.

4kB arguably isn't optimal anymore since we have way more memory now than when it was de facto standardised so it doesn't matter as much if we waste a bit. Maybe.

As an aside, it's shame that hardware page table walking won out over software filled TLBs, as some older computers had. I wonder what clever and wonderful hacks we might have been able to invent had we not needed to give the CPU a raw pointer to a data structure the layout of which is fixed forever.

Software table walk performance is bad on modern out of order processors because it has to finish every older instruction in flight and redirect the front end to the exception vector. This can take several hundred cycles. Hardware table walk can take <20 cycles to hit in the next level cache.

Well, if you want to run headfirst into the magical land of hardware errata, I guess you could go around creating heterogeneous, switched mappings.

I doubt the TCRs were ever intended to support rapid runtime switching or that the TLBs were ever intended to support heterogeneous entrys even with ASID tagging.

You've listed things that could go wrong without citing specific errata. Should we just assume that hardware doesn't work as documented? It seems premature to deem the feature buggy without having tried it.

The support for mTHP exists in upstream Linux, but the swap story is not quite there yet. THP availability also needs work and there are a few competing directions.

Supporting multiple page sizes well transparently is non-trivial.

For a recent summary on one of the approaches, TAO (THP Allocation Optimization), see this lwn article: https://lwn.net/Articles/974636/

Google/Android doesn't care much about backward compatibility and broke programs released on Pixel 3 in Pixel 7. (the interdiction of 32bit-only apps is 2019 on Play Store, Pixel 7 is first 64bits only device, while Google still released 32bits only device in 2023...). They quite regularly break apps in new Android versions (despite their infrastructure to handle backward compatibility), and app developers are used to brace themselves around Android & Pixel releases

Generally I've found Google to care much more about not breaking old apps compared to Apple, which often expects developers to rebuild apps for OS updates or else the apps stop working entirely (or buy entirely new machines to get OS updates at all, e.g. the Intel/Apple Silicon transition). Google isn't on the level of Windows "we will watch for specific binaries and re-introduce bugs in the kernel specifically for those binaries that they depend on" in terms of backwards compatibility, but I wouldn't go so far as to say they don't care. I'm not sure whether that's better or worse: there's definitely merit to Apple's approach, since it keeps them able to iterate quickly on UX and performance by dropping support for the old stuff.

> all old binaries break and emulators which emulate normal systems with 4KB pages

Would it actually affect the kind of emulators present on Android, i.e. largely software-only ones, as opposed to hardware virtualizers making use of a CPU's vTLB?

Wine is famously not an emulator and as such doesn't really exist/make sense on (non-x86) Android (as it would only be able to execute ARM binaries, not x86 ones).

For the downvote: Genuinely curious here on which type of emulator this could affect.

It should not break userland. GNU/Linux (not necessarily Android though) has supported 64K pages pretty much from the start because that was the originally page size chosen for server-focus kernels and distributions. But there are some things that need to be worked around.

Certain build processes determine the page size at compile time and assume it's the same at run time, and fail if it is not: https://github.com/jemalloc/jemalloc/issues/467

Some memory-mapped files formats have assumptions about page granularity: https://bugzilla.redhat.com/show_bug.cgi?id=1979804

The file format issue applies to ELF as well. Some people patch their toolchains (or use suitable linker options) to produce slightly smaller binaries that can only be loaded if the page size is 4K, even though the ABI is pretty clear in that you should link for compatibility with up to 64K pages.

Replacing compile time constants with function calls will always bring some trouble, suddenly you need to rearrange your structures, optimizations get missed (in extreme cases you can accidentally introduce a DIV instruction), etc. So it is not surprising that code assumes 4k pages.

A more relevant bit of background is that 4KB pages lead to quite a lot of overhead due to the sheer number of mappings needing to be configured and cached. Using larger pages reduce overhead, in particular TLB misses as fewer entries are needed to describe the same memory range.

While x86 chips mainly supports 4K, 2M and 1G pages, ARM chips tend to support more practical 16K page sizes - a nice balance between performance and wasting memory due to lower allocation granularity.

Nothing in particular to do with Apple and iOS.

Makes me wonder how much performance Windows is leaving on the table with its primitive support for large pages. It does support them, but it doesn't coalesce pages transparently like Linux does, and explicitly allocating them requires special permissions and is very likely to fail due to fragmentation if the system has been running for a while. In practice it's scarcely used outside of server software which immediately grabs a big chunk of large pages at boot and holds onto them forever.

A lot of low level stuff is a lot slower on Windows, let alone the GUI. There's also entire blogs cataloging an abundance of pathological performance issues.

The one I notice the most is the filesystem. Running Linux in VirtualBox, I got 7x the host speed for many small file operations. (On top of that Explorer itself has its own random lag.)

I think a better question is how much performance are they leaving on the table by bloating the OS so much. Like they could have just not touched Explorer for 20 years and it would be 10x snappier now.

I think the number is closer to 100x actually. Explorer on XP opens (fully rendered) after a single video frame... also while running virtualized inside Win10.

Meanwhile Win10 Explorer opens after a noticeable delay, and then spends the next several hundred milliseconds painting the UI elements one by one...

> The one I notice the most is the filesystem.

This is due to the extensible file system filter model in place; I'm not aware of another OS that implements this feature and is primarily used for antivirus, but can be used by any developer for any purpose.

It applies to all file systems on Windows.

DevDrive[0] is Microsoft's current solution to this.

> Meanwhile Win10 Explorer opens after a noticeable delay

This could be, again, largely due to 3rd party hooks (or 1st party software that doesn't ship with Windows) into Explorer.

[0] https://devblogs.microsoft.com/visualstudio/devdrive/

> I'm not aware of another OS that implements this feature

I'm not sure this is exactly what you mean, but Linux has inotify and all sorts of BPF hooks for filtering various syscalls, for example file operations.

FSFilters are basically a custom kernel module that can and will do anything they want on any filesystem access. (There's also network filters, which is how things like WinPcap get implemented.)

So yes, you could implement something similar in Linux, but there's not, last I looked, a prebuilt toolkit and infrastructure for them, just the generic interfaces you can use to hook anything.

(Compare the difference between writing a BPF module to hook all FS operations, and the limitations of eBPF, to having an InterceptFSCalls struct that you define in your custom kernel module to run your own arbitrary code on every access.)

I'm glad you mentioned that. I noticed when running "Hello world" C program on Windows 10 that Windows performs over 100 reads of the Registry before running the program. Same thing when I right click a file...

A few of those are 3rd party, but most are not.

> The one I notice the most is the filesystem

I’m not sure it’s the file system per se, I believe the main reason is the security model.

NT kernel has rather sophisticated security. The securable objects have security descriptors with many access control entries and auditing rules, which inherit over file system and other hierarchies according to some simple rules e.g. allow+deny=deny. Trustees are members of multiple security groups, and security groups can include other security groups so it’s not just a list, it’s a graph.

This makes access checks in NT relatively expensive. The kernel needs to perform access check every time a process creates or opens a file, that’s why CreateFile API function is relatively slow.

I've been trying to use auditing rules for a usage that seems completely in scope and obvious to prioritize from a security point of view (tracing access to EFS files and/or the keys allowing the access) and my conclusion was that you basically can't, the doc is garbage, the implementation is probably ad-hoc with lots of holes, and MS probably hasn't prioritised the maintenance of this feature since several decades (too busy adding ads in the start menu I guess)

The NT security descriptors are also so complex they are probably a little useless in practice too, because it's too hard to use correctly. On top of that the associated Win32 API is also too hard to use correctly to the point that I found an important bug in the usage model described in MSDN, meaning that the doc writer did not know how the function actually work (in tons of cases you probably don't hit this case, but if you start digging in all internal and external users, who knows what you could find...)

NT was full of good ideas but the execution is often quite poor.

From an NTFS auditing perspective, there’s no difference between auditing a non-EFS file or EFS file. Knowing that file auditing works just fine having done it many times, what makes you say it doesn’t work?

> The one I notice the most is the filesystem. Running Linux in VirtualBox, I got 7x the host speed for many small file operations. (On top of that Explorer itself has its own random lag.)

That’s a very old problem. In early days of subversion, the metadata for every directory existed in the directory. The rationale was that you could check out just a directory in svn. It was disastrously slow on Windows and the subversion maintainers had no answer for it, except insulting ones like “turn off virus scanning”. Telling a windows user to turn off virus scanning is equivalent to telling someone to play freeze tag in traffic. You might as well just tell them, “go fuck yourself with a rusty chainsaw”

Someone reorganized the data so it all happened at the root directory and the CLI just searched upward until it found the single metadata file. If memory serves that made large checkouts and updates about 2-3 times faster on Linux and 20x faster on windows.

Quite a bit, but 2M is an annoying size and the transparent handling is suboptimal. Without userspace cooperating, the kernel might end up having to split the pages at random due to an unfortunate unaligned munmap/madvise from an application not realizing it was being served 2M pages.

Having Intel/AMD add 16-128K page support, or making it common for userspace to explicitly ask for 2M pages for their heap arenas is likely better than the page merging logic. Less fragile.

1G pages are practically useless outside specialized server software as it is very difficult to find 1G contiguous memory to back it on a “normal” system that has been running for a while.

Intel's menu of page sizes is an artifact of its page table structure.

On x86 in 64-bit mode, page table entries are 64 bits each; the lowest level in the hierarchy (L1) is a 4K page containing 512 64-bit of PTEs which in total map 2M of memory, which is not coincidentally the large page size.

The L1 page table pages are themselves found via a PTE in a L2 page table; one L2 page table page maps 512*2M = 1G of virtual address space, which is again, not coincidentally, the huge page size.

Large pages are mapped by a L2 PTE (sometimes called a PDE, "page directory entry") with a particular bit set indicating that the PTE points at the large page rather than a PTE page. The hardware page table walker just stops at that point.

And huge pages are similarly mapped by an L3 PTE with a bit set indicating that the L3 PTE is a huge page.

Shoehorning an intermediate size would complicate page table updates or walks or probably both.

Note that an OS can, of its own accord independent of hardware maintain allocations as a coarser granularity and sometimes get some savings out of this. For one historic example, the VAX had a tiny 512-byte page size; IIRC, BSD unix pretended it had a 1K page size and always updated PTEs in pairs.

Hmm? Pretending the page size is larger than it is would not yield the primary performance benefits of reduced TLB misses. Unless I am missing something, that seems more like a hack to save a tiny bit of kernel memory on a constrained system by having two PTE’s backed by the same internal page structure.

Unless we can change the size of the smallest page entry on Intel, I doubt there is room to do anything interesting there. If we could do like ARM and just multiply all the page sizes by 4 you would avoid any “shoehorning”.

The smallest page size tends to get entrenched in the rest of the system (for things like linker page sizes, IOMMU interfaces, etc.,); growing the smallest page size might not be a viable option in existing systems and it might be easier to introduce intermediate-size TLB entries, perhaps formed by consolidating adjacent contiguous PTE's..

Would a reasonable compromise be to change the base allocation granularity to 2MB, and transparently sub-allocate those 2MB blocks into 64KB blocks (the current Windows allocation granularity) when normal pages are requested? That feels like it should keep 2MB page fragmentation to a minimum without breaking existing software, but given they haven't done it there's probably some caveat I'm overlooking.

It's a little weirder. At least one translation granule is required but it is up to the implementation to choose which one(s) they want. Many older Arm cores only support 4KB and 64KB but newer ones support all three.

The size of the translation granule determines the size of the block entries at each level. So 4K granules has super pages of 2MB and 1GB, 16KB granules has 32MB super pages, and 64K has 512MB super pages.

That this isn't the only 4K→16K transition in recent history? Some programs that assumed 4K had to be fixed as part of the transition, this can provide insights for the work required for Android.

apple's m-series chips use a 16kb page size by default so the state of things has improved significantly with software wanting to support asahi and other related endeavors

I didn't realize they had reverted it, I used to run RHEL builds on Pi systems to test for 64k page bugs because it's not like there's a POWER SBC I could buy for this.

I wonder how much help they had by asahi doing a lot of the kernel and ecosystem work anablibg 16k pages.

RISC-V being fixed to 4k pages seems to be a bit of an oversight as well.

It's pretty cool that I can read "anablibg" and know that means "enabling." The brain is pretty neat. I wonder if LLMs would get it too. They probably would.

Question I wrote:

> I encountered the typo "anablibg" in the sentence "I wonder how much help they had by asahi doing a lot of the kernel and ecosystem work anablibg 16k pages." What did they actually mean?

GPT-4o and Sonnet 3.5 understood it perfectly. This isn't really a problem for the large models.

For local small models:

* Gemma2 9b did not get it and thought it meant "analyzing".

* Codestral (22b) did not it get it and thought it meant "allocating".

* Phi3 Mini failed spectacularly.

* Phi3 14b and Qwen2 did not get it and thought it was "annotating".

* Mistral-nemo thought it was a portmanteau "anabling" as a combination of "an" and "enabling". Partial credit for being close and some creativity?

* Llama3.1 got it perfectly.

I wonder if they'd do better if there was the context that it's in a thread titled "Adding 16 kb page size to Android"? The "analyzing" interpretation is plausible if you don't know what 16k pages, kernels, Asahi, etc are.

Local LLM topics are a treadmill of “what’s best and what is preferred” changing basically weekly to monthly, it’s a rapidly evolving field, but right now I actually tend to gravitate to Gemma2 9b for coding assistance for Typescript work or general question and answer stuff. Its embedded knowledge and speed on the computers that I have (32GB M2 Max, 16GB M1 Air, 4080 gaming desktop) make for a good balance while also using the computer for other stuff, bigger models limit what else I can run simultaneously and are slower than my reading speed, smaller models have less utility and the speed increase is pointless if they’re dumb.

I remember reading somewhere that LLMs are actually fantastic at reading heavily mistyped sentences! Mistyped to a level where humans actually struggle.

(I will update this comment if I find a source)

I asked chatgpt and it did get it.

Personally, when I read the comment my brain kinda skipped over the word since it contained the part "lib" I assumed it was some obscure library that I didn't care about. It doesn't fit grammatically but I didn't give it enough thought to notice.

RV64 has some reserved encoding space in satp.mode so there's an obvious path to expanding the number of page table formats at a later time. Just requires everyone to agree on the direction (common issue with RISC-V).

For RV32 I think we are probably stuck with Sv32 4k pages forever.

Probably wouldn't be too hard to add a 16 kB page size extension. But I think the Svnapot extension is their solution to this problem. If you're not familiar it lets you mark a set of pages as being part of a contiguously mapped 64 kB region. No idea how the performance characteristics vary. It relieves TLB pressure, but you still have to create 16 4kB page table entries.

Svnapot is a poor solution to the problem.

On one hand it means that that each page table entry takes up half a cache line for the 16KB case, and two whole cache lines in the 64KB case. This really cuts down on the page walker hardware's ability to effectively prefetch TLB entries, leading to basically the same issues as this classic discussion about why tree based page tables are generally more effective than hash based page tables (shifted forward in time to today's gate counts). https://yarchive.net/comp/linux/page_tables.html This is why ARM shifted from a Svnapot like solution to the "translation granule queryable and partially selectable at runtime" solution.

Another issue is the fact that a big reason to switch to 16KB or even 64KB pages is to allow for more address range for VIPT caches. You want to allow high performance implementations to be able to look up the cache line while performing the TLB lookup in parallel, then compare the tag with the result of the TLB lookup. This means that practically only the untranslated bits of the address can be used by the set selection portion of the cache lookup. When you have 12 bits untranslated in a address, combined with 64 byte cachelines gives you 64 sets, multiply that by 8 ways and you get the 32KB L1 caches very common in systems with 4KB page sizes (sometimes with some heroic effort to throw a ton of transistors/power at the problem to make a 64KB cache by essentially duplicating large parts of the cache lookup hardware for that extra bit of address). What you really want is for the arch to be able to disallow 4KB pages like on apple silicon which is the main piece that allows their giant 128LB and 192KB L1 caches.

> What you really want is for the arch to be able to disallow 4KB pages like on apple silicon which is the main piece that allows their giant 128LB and 192KB L1 caches.

Minor nit but they allow 4k pages. Linux doesn't support 16k and 4k pages at the same time; macOS does but is just very particular about 4k pages being used for scenarios like Rosetta processes or virtual machines e.g. Parallels uses it for Windows-on-ARM, I think. Windows will probably never support non-4k pages I'd guess.

But otherwise, you're totally right. I wish RISC-V had gone with the configurable granule approach like ARM did. Major missed opportunity but maybe a fix will get ratified at some point...

> This means that practically only the untranslated bits of the address can be used by the set selection portion of the cache lookup

It's true that this makes things difficult, but Arm have been shipping D caches with way size > page size for decades. The problem you get is that virtual synonyms of the same physical cache block can become incoherent with one another. You solve this by extending your coherence protocol to cover the potential synonyms of each index in the set (so for example with 16 kB/way and 4 kB pages, there are four potential indices for each physical cache block, and you need to maintain their coherence). It has some cost and the cost scales with the ratio of way size : page size, so it's still desirable to stay under the limit, e.g. by just increasing the number of cache ways.

  iOS has had 16K pages since forever.
  OSX switched to 16K pages in 2020 with the M1.
  Windows is stuck on 4K pages, even for AArch64.
  Linux has various page sizes. Asahi is 16K.

Microsoft wanted to make x86 compatibility as painless as possible. They adopted an ABI in which registers can be generally mapped 1:1 between the two architectures.

Now I wonder: Does increased page size have any negative impacts on I/O performance or flash lifetime, e.g. for writebacks of dirty pages of memory-mapped files where only a small part was changed?

Or is the write granularity of modern managed flash devices (such as eMMCs as used in Android smartphones) much larger than either 4 or 16 kB anyway?

Flash controllers expose blocks of 512B or 4096KB, but the actual NAND chips operate in terms of "erase blocks" which range from 1MB to 8MB (or really anything); in these blocks, an individual bit can be flipped from "0" to "1" once, and flipping any bit back to "0" requires erasing the entire block and flipping the desired bits back to "1" [0].

All of this is hidden from the host by the NAND controller, and SSDs employ many strategies (including DRAM caching, heterogeneous NAND dies, wear-leveling and garbage-collection algorithms) to avoid wearing the storage NAND. Effectively you must treat flash storage devices as block devices of their advertised block size because you have no idea where your data ends up physically on the device, so any host-side algorithm is fairly worthless.

[0]: https://spdk.io/doc/ssd_internals.html

Writes on NAND happen at the block, not the page level, though. I believe the ratio between the two is usually something like 1:8 or so.

Even blocks might still be larger than 4KB, but if they’re not, presumably a NAND controller could allow such smaller writes to avoid write amplification?

The mapping between physical and logical block address is complex anyway because of wear leveling and bad block management, so I don’t think there’s a need for write granularity to be the erase block/page or even write block size.

> Even blocks might still be larger than 4KB, but if they’re not, presumably a NAND controller could allow such smaller writes to avoid write amplification?

Look at what SandForce was doing a decade+ ago. They had hardware compression to lower write amp and some sort of 'battery backup' to ensure operations completed. Various bits of this sort of tech is in most decent drives now.

> The mapping between physical and logical block address is complex anyway because of wear leveling and bad block management, so I don’t think there’s a need for write granularity to be the erase block/page or even write block size.

The controller needs to know what blocks can get a clean write vs what needs an erase; that's part of the trim/gc process they do in background.

Assuming you have sufficient space, it works kinda like this:

- Writes are done to 'free-free' area, i.e. parts of the flash it can treat like SLC for faster access and less wear. If you have less than 25%-ish of drive free this becomes a problem. Controller is tracking all of this state.

- When it's got nothing better to do for a bit, controller will work to determine which old blocks to 'rewrite' with data from the SLC-treated flash into 'longer lived' but whatever-Level-cell storage. I'm guessing (hoping?) there's a lot of fanciness going on there, i.e. frequently touched files take longer to get a full rewrite.

TBH sounds like a fun thing to research more

I would expect that this increases the gap between new and old phones / makes old phones unusable more quickly: new phones will typically have enough RAM and can live with the 9% less efficient memory use, and will see the 5-10% speedup. Old phones are typically bottlenecked at RAM, now 9% earlier, and reloading pages from disk (or swapping if enabled) will have a much higher overhead than 5-10%.

I see they have measured improvements in the performance of some things. In particular, the camera app starts faster. Small percentage, but still real.

Curious if there are any other changes you could do based on some of those learnings? The camera app, in particular, seems like a good one to optimize to start instantly. Especially so with the the shortcut "double power key" that many phones/people have setup.

Specifically, I would expect you should be able to do something like the lisp norm of "dump image?" Startup should then largely be loading the image, not executing much if any initialization code? (Honestly, I mostly assume this already happens?)

Can someone explain those numbers to me?

5-10% performance boost sounds huge. Wouldn't we have much larger TLBd if page walk was really this expensive?

On the other hand 9% increase in memory usage also sounds huge. How did this affect memory usage that much?

> 5-10% performance boost sounds huge. Wouldn't we have much larger TLBd if page walk was really this expensive?

It's pretty typical for large programs to spend 15+% of their "CPU time" waiting for the TLB. [1] So larger pages really help, including changing the base 4 KiB -> 16 KiB (4x reduction in TLB pressure) and using 2 MiB huge pages (512x reduction where it works out).

I've also wondered why the TLB isn't larger.

> On the other hand 9% increase in memory usage also sounds huge. How did this affect memory usage that much?

This is the granularity at which physical memory is assigned, and there are a lot of reasons most of a page might be wasted:

* The heap allocator will typically cram many things together in a page, but it might say only use a given page for allocations in a certain size range, so not all allocations will snuggle in next to each other.

* Program stacks each use at least one distinct page of physical RAM because they're placed in distinct virtual address ranges with guard pages between. So if you have 1,024 threads, they used at least 4 MiB of RAM with 4 KiB pages, 16 MiB of RAM with 16 KiB pages.

* Anything from the filesystem that is cached in RAM ends up in the page cache, and true to the name, it has page granularity. So caching a 1-byte file would take 4 KiB before, 16 KiB after.

[1] If you have an Intel CPU, toplev is particularly nice for pointing this kind of thing out. https://github.com/andikleen/pmu-tools

Not entirely related (except the block size), but I am considering making and standardizing a system-wide content-based cache with default block size 16KB.

The idea is that you'd have a system-wide (or not) service that can do two or three things:

- read 16KB block by its SHA256 (also return length that can be <16KB), if cached

- write a block to cache

- maybe pin a block (e.g. make it non-evictable)

I would be like a block-level file content dedup + eviction to keep the size limited.

Should reduce storage used by various things due to dedup functionality, but may require internet for corresponding apps to work properly.

With a peer-to-peer sharing system on top of it may significantly reduce storage requirements.

The only disadvantage is the same as with shared website caches prior to cache isolation introduction: apps can poke what you have in your cache and deduce some information about you from it.

I'd probably pick a size greater than 16KB for that. Windows doesn't expose translations less than 64KB in their version of mmap, and internally their file cache works in increments of 256KB. And these were numbers they picked back in the 90s.

I would go for higher than 16K. I believe BitTorrent's default minimum chunk size is 64K, for example. It really depends on the use case in question though, if you're doing random writes then larger chunk sizes quickly waste a ton of bandwidth, especially if you're doing recursive rewrites of a tree structure.

Would a variable chunk size be acceptable for whatever it is you're building?

Good. It's about time. 4KB pages come down to us from 32-bit time immemorial. We didn't bump the page size when we doubled the sizes of pointers and longs for the 64-bit transition. 4KB has been way too small for ages, and I'm glad we're biting the minor compatibility bullet and adopting a page size more suited to modern computing.

Because the whole thing sounds like they’re doing something new, but they’re just catching up to something Apple has done back when they switched to aarch64.

（评论） (comments)

（评论）
(comments)