路径应该是一个系统调用。
Path should be a system call

原始链接: https://simonsafar.com/2025/path_as_system_call/

作者指出了一种常见的性能瓶颈:反复检查多个目录中是否存在文件,例如Emacs加载Lisp文件、`bash`在`$PATH`中查找可执行文件以及Python模块发现。这些过程涉及大量的系统调用(`newfstatat`,`openat`),并在路径列表的每个目录中检查多个文件名变体(例如,gzip压缩版本)。虽然Python通过列出目录进行了一些优化,但根本问题依然存在。作者提出了一种更高效的方法:操作系统可以提供一种查询机制,在一个操作中搜索一组目录中的一组文件,而不是进行单独的文件存在性检查。这将减少系统调用和网络往返次数,类似于AS/400库处理文件查找的方式。作者建议使用像Postgres这样的数据库来有效地解决这个问题,并提出了一个问题:操作系统或文件系统是否可以提供类似的内置、优化的文件搜索功能。

这个 Hacker News 讨论帖讨论了一个提案,为了性能原因,建议将查找 PATH 环境变量中的可执行文件做成单个系统调用。一些用户建议使用现有的工具或库来实现类似的功能,例如 Linux 上的 `io_uring_prep_stat` 或 Windows 的 `shlwapi.h` 中的函数。其他用户提到 shell 会缓存二进制文件的位置以避免冗余查找。讨论还涉及到带有元数据数据库的文件系统对象存储的潜在好处。一位用户链接到 WinFS,一个失败的微软项目,它尝试过类似的事情,并指出性能问题和增加的内存需求是导致其失败的因素。最后,有人建议 NTFS 高效地将小文件存储在文件记录中,但由于驱动程序堆栈和兼容性层,Windows 上的整体文件查找过程仍然很慢。
相关文章
  • (评论) 2024-04-18
  • (评论) 2024-08-19
  • (评论) 2024-07-22
  • (评论) 2024-07-05
  • (评论) 2024-03-15

  • 原文

    2025/04/22

    (... but... it's a variable... how do you even)

    Let us present the problem.

    ... a bunch of path lookups done by Emacs

    This is Emacs starting up and loading some Lisp files. For which we first need to figure out where to find them.

    As it happens, they could be found at many possible locations. There is a list of these locations in the load-path variable; our method is to check whether it's present at each of them. (Also, maybe some of them come gzipped; let's check for those ones, too.)

    On my not especially overcomplicated Emacs install, the list has 59 elements.

    At first sight this sounds like such a niche problem. Not only is it about Emacs but it's also Windows; the latter is somewhat known of its less than excellent performance when it comes to small files.

    As it happens though, bash on Linux does the exact same thing. We have a list of directories on PATH, and, whenever we want to launch a program, we'll go and check each and one of them for the files we are looking for. We're fairly lucky though: the list is pretty short.

    ~ $ strace bash -c asdklfjasldfjaskldfasdljf
    (...)
    newfstatat(AT_FDCWD, ".", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
    newfstatat(AT_FDCWD, "/home/simon/bin/asdklfjasldfjaskldfasdljf", 0x7ffe5ff8d3c0, 0) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "/usr/local/bin/asdklfjasldfjaskldfasdljf", 0x7ffe5ff8d3c0, 0) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "/usr/bin/asdklfjasldfjaskldfasdljf", 0x7ffe5ff8d3c0, 0) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "/bin/asdklfjasldfjaskldfasdljf", 0x7ffe5ff8d3c0, 0) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "/usr/games/asdklfjasldfjaskldfasdljf", 0x7ffe5ff8d3c0, 0) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "/usr/local/games/asdklfjasldfjaskldfasdljf", 0x7ffe5ff8d3c0, 0) = -1 ENOENT (No such file or directory)
              

    ... except wait, now we're looking for ourselves?

    newfstatat(AT_FDCWD, ".", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
    newfstatat(AT_FDCWD, "/home/simon/bin/bash", 0x7ffe5ff8d490, 0) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "/usr/local/bin/bash", 0x7ffe5ff8d490, 0) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "/usr/bin/bash", 0x7ffe5ff8d490, 0) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "/bin/bash", {st_mode=S_IFREG|0755, st_size=1265648, ...}, 0) = 0
    newfstatat(AT_FDCWD, "/bin/bash", {st_mode=S_IFREG|0755, st_size=1265648, ...}, 0) = 0
    
    

    ... and also... let's not forget about our localized messages.

    openat(AT_FDCWD, "/usr/share/locale/en_US.UTF-8/LC_MESSAGES/bash.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
    openat(AT_FDCWD, "/usr/share/locale/en_US.utf8/LC_MESSAGES/bash.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
    openat(AT_FDCWD, "/usr/share/locale/en_US/LC_MESSAGES/bash.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
    openat(AT_FDCWD, "/usr/share/locale/en.UTF-8/LC_MESSAGES/bash.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
    openat(AT_FDCWD, "/usr/share/locale/en.utf8/LC_MESSAGES/bash.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
    openat(AT_FDCWD, "/usr/share/locale/en/LC_MESSAGES/bash.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
    newfstatat(2, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}, AT_EMPTY_PATH) = 0
    
    
              

    As it happens, Python is slightly smarter than either of the two above. Instead of trying various file names, it will just go and lists directories right away; it is probably this & some caching mechanisms that allow it to find some modules pretty quickly. (We're still looking for __init__.py and similar ones one by one though.)

    
    simon@anarillis ~/tmp> strace -f python3 -m our_test_dir.our_test_moduleb 2>&1 |grep our_test
    execve("/usr/bin/python3", ["python3", "-m", "our_test_dir.our_test_moduleb"], 0x7ffc087c2c38 /* 17 vars */) = 0
    newfstatat(AT_FDCWD, "/home/simon/tmp/our_test_dir/__init__.cpython-311-x86_64-linux-gnu.so", 0x7ffc3025b8e0, 0) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "/home/simon/tmp/our_test_dir/__init__.abi3.so", 0x7ffc3025b8e0, 0) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "/home/simon/tmp/our_test_dir/__init__.so", 0x7ffc3025b8e0, 0) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "/home/simon/tmp/our_test_dir/__init__.py", 0x7ffc3025b8e0, 0) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "/home/simon/tmp/our_test_dir/__init__.pyc", 0x7ffc3025b8e0, 0) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "/home/simon/tmp/our_test_dir", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
    newfstatat(AT_FDCWD, "/home/simon/tmp/our_test_dir", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
    newfstatat(AT_FDCWD, "/home/simon/tmp/our_test_dir", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
    newfstatat(AT_FDCWD, "/home/simon/tmp/our_test_dir", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
    
    # here is the dir listing!
    openat(AT_FDCWD, "/home/simon/tmp/our_test_dir", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
    write(2, "/usr/bin/python3: No module name"..., 64/usr/bin/python3: No module named our_test_dir.our_test_moduleb
              

    Nevertheless, it seems that "trying to find files with a set of possible names in a set of possible directories" is a fairly common operation that not everyone has optimized yet.

    (Also, is "optimizing" this really a good goal? Or does it just stand for "OK workarounds for missing file system APIs"?)

    How about... instead of asking the operating system for a combination of n files at m different places, we could just give it the list of possible files and the list of possible places?

    This would already cut down on the number of system calls, and, if this is going over a network, the required roundtrips.

    AS/400 libraries are, by the way, solving a very similar problem. While I'm not sure what implementation they're using underneath, they have at least a good chance for not having to try every combo all the time, given their database "filesystem".

    But then, in the end, we are just trying to perform a query, to select all the source files ever WHERE they have one of the given names & then we pick the ones that are in source directories we prefer the most (e.g. come first on the PATH list). That's it.

    As it happens, Postgres can solve this problem extremely well and quickly. (... there might be a blog post on how, at some point.)

    Could it be something that the operating system or the file system just... does for you, quickly and efficiently?

    联系我们 contact @ memedata.com