Linux 沙箱和 Fil-C

Linux 沙箱和 Fil-C
Linux Sandboxes and Fil-C

## Fil-C 实现内存安全与沙箱结合内存安全和沙箱是不同的，但互补的安全措施。一个程序可以具备内存安全，但没有沙箱（易受文件系统操作的影响）；也可以被沙箱化，但没有内存安全（如果内存安全失败，仍然容易受到攻击）。理想的解决方案是两者兼备。本文档详细介绍了将 Fil-C（一种内存安全的 C/C++ 实现）与 Linux 沙箱技术集成，特别是 OpenSSH 所使用的技术。Linux 提供了 `chroot`、用户/组权限、`setrlimit` 和 `seccomp-BPF`（系统调用过滤）等工具来实现沙箱化。 Fil-C 简化了 `chroot` 和权限管理的使用，但 `setrlimit` 和 `seccomp-BPF` 由于 Fil-C 的运行时线程（用于垃圾回收）需要谨慎处理。一个关键的挑战是防止在沙箱内创建线程，因为这会绕过限制。通过新的 Fil-C API (`zlock_runtime_threads`) 预先创建必要的运行时线程来解决这个问题。对 OpenSSH 沙箱的修改包括确保在发生违规时杀死所有线程，并允许 Fil-C 所需的特定系统调用 (`MAP_NORESERVE`, `sched_yield`)。Fil-C 的 `prctl` 包装器确保沙箱设置应用于*所有*线程，从而防止即使在 Fil-C 本身中发现内存安全漏洞，也可能被绕过。最终，将 Fil-C 的内存安全与强大的 Linux 沙箱结合起来，提供了一种强大的纵深防御方法。

黑客新闻新的 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交登录 Linux 沙盒和 Fil-C (fil-c.org) 31 分，由 pizlonator 1 小时前发布 | 隐藏 | 过去 | 收藏 | 2 条评论 hurturue 11 分钟前 [–] MicroVM 似乎越来越受欢迎。我想知道它们如何融入其中。回复 pizlonator 7 分钟前 | 父级 [–] 好点！这需要一些移植（因为 Fil-C 目前假定你拥有所有 Linux 系统调用）。但你可能甚至可以将一些 microVM 的功能提升到 Fil-C 的内存安全用户空间。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

Memory safety and sandboxing are two different things. It's reasonable to think of them as orthogonal: you could have memory safety but not be sandboxed, or you could be sandboxed but not memory safe.

Example of memory safe but not sandboxed: a pure Java program that opens files on the filesystem for reading and writing and accepts filenames from the user. The OS will allow this program to overwrite any file that the user has access to. This program can be quite dangerous even if it is memory safe. Worse, imagine that the program didn't have any code to open files for reading and writing, but also had no sandbox to prevent those syscalls from working. If there was a bug in the memory safety enforcement of this program (say, because of a bug in the Java implementation), then an attacker could cause this program to overwrite any file if they succeeded at achieving code execution via weird state.
Example of sandboxed but not memory safe: a program written in assembly that starts by requesting that the OS revoke all of its capabilities beyond just pure compute. If the program did want to open a file or write to it, then the kernel will kill the process, based on the earlier request to have this capability revoked. This program could have lots of memory safety bugs (because it's written in assembly), but even if it did, then the attacker cannot make this program overwrite any file unless they find some way to bypass the sandbox.

In practice, sandboxes have holes by design. A typical sandbox allows the program to send and receive messages to broker processes that have higher privileges. So, an attacker may first use a memory safety bug to make the sandboxed process send malicious messages, and then use those malicious messages to break into the brokers.

The best kind of defense is to have both a sandbox and memory safety. This document describes how to combine sandboxing and Fil-C's memory safety by explaining what it takes to port OpenSSH's seccomp-based Linux sandbox code to Fil-C.

Background

Fil-C is a memory safe implementation of C and C++ and this site has a lot of documentation about it. Unlike most memory safe languages, Fil-C enforces safety down to where your code meets Linux syscalls and the Fil-C runtime is robust enough that it's possible to use it in low-level system components like init and udevd. Lots of programs work in Fil-C, including OpenSSH, which makes use of seccomp-BPF sandboxing.

This document focuses on how OpenSSH uses seccomp and other technologies on Linux to build a sandbox around its unprivileged sshd-session process. Let's review what tools Linux gives us that OpenSSH uses:

chroot to restrict the process's view of the filesystem.
Running the process with the sshd user and group, and giving that user/group no privileges.
setrlimit to prevent opening files, starting processes, or writing to files.
seccomp-BPF syscall filter to reduce the attack surface by allowlisting only the set of syscalls that are legitimate for the unprivileged process. Syscalls not in the allowlist will crash the process with SIGSYS.

The Chromium developers and the Mozilla developers both have excellent notes about how to do sandboxing on Linux using seccomp. Seccomp-BPF is a well-documented kernel feature that can be used as part of a larger sandboxing story.

Fil-C makes it easy to use chroot and different users and groups. The syscalls that are used for that part of the sandbox are trivially allowed by Fil-C and no special care is required to use them.

Both setrlimit and seccomp-BPF require special care because the Fil-C runtime starts threads, allocates memory, and performs synchronization. This document describes what you need to know to make effective use of those sandboxing technologies in Fil-C. First, I describe how to build a sandbox that prevents thread creation without breaking Fil-C's use of threads. Then, I describe what tweaks I had to make to OpenSSH's seccomp filter. Finally, I describe how the Fil-C runtime implements the syscalls used to install seccomp filters.

Preventing Thread Creation Without Breaking The Fil-C Runtime

The Fil-C runtime uses multiple background threads for garbage collection and has the ability to automatically shut those threads down when they are not in use. If the program wakes up and starts allocating memory again, then those threads are automatically restarted.

Starting threads violates the "no new processes" rule that OpenSSH's setrlimit sandbox tries to achieve (since threads are just lightweight processes on Linux). It also relies on syscalls like clone3 that are not part of OpenSSH's seccomp filter allowlist.

It would be a regression to the sandbox to allow process creation just because the Fil-C runtime relies on it. Instead, I added a new API to <stdfil.h>:

void zlock_runtime_threads(void);

This forces the runtime to immediately create whatever threads it needs, and to disable shutting them down on demand. Then, I added a call to zlock_runtime_threads() in OpenSSH's ssh_sandbox_child function before either the setrlimit or seccomp-BPF sandbox calls happen.

Tweaks To The OpenSSH Sandbox

Because the use of zlock_runtime_threads() prevents subsequent thread creation from happening, most of the OpenSSH sandbox just works. I did not have to change how OpenSSH uses setrlimit. I did change the following about the seccomp filter:

Failure results in SECCOMP_RET_KILL_PROCESS rather than SECCOMP_RET_KILL. This ensures that Fil-C's background threads are also killed if a sandbox violation occurs.
MAP_NORESERVE is added to the mmap allowlist, since the Fil-C allocator uses it. This is not a meaningful regression to the filter, since MAP_NORESERVE is not a meaningful capability for an attacker to have.
sched_yield is allowed. This is not a dangerous syscall (it's semantically a no-op). The Fil-C runtime uses it as part of its lock implementation.

Nothing else had to change, since the filter already allowed all of the futex syscalls that Fil-C uses for synchronization.

How Fil-C Implements `prctl`

The OpenSSH seccomp filter is installed using two prctl calls. First, we PR_SET_NO_NEW_PRIVS:

if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) == -1) {
        debug("%s: prctl(PR_SET_NO_NEW_PRIVS): %s",
            __func__, strerror(errno));
        nnp_failed = 1;
}

This prevents additional privileges from being acquired via execve. It's required that unprivileged processes that install seccomp filters first set the no_new_privs bit.

Next, we PR_SET_SECCOMP, SECCOMP_MODE_FILTER:

if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &preauth_program) == -1)
        debug("%s: prctl(PR_SET_SECCOMP): %s",
            __func__, strerror(errno));
else if (nnp_failed)
        fatal("%s: SECCOMP_MODE_FILTER activated but "
            "PR_SET_NO_NEW_PRIVS failed", __func__);

This installs the seccomp filter in preauth_program. Note that this will fail in the kernel if the no_new_privs bit is not set, so the fact that OpenSSH reports a fatal error if the filter is installed without no_new_privs is just healthy paranoia on the part of the OpenSSH authors.

The trouble with both syscalls is that they affect the calling thread, not all threads in the process. Without special care, Fil-C runtime's background threads would not have the no_new_privs bit set and would not have the filter installed. This would mean that if an attacker busted through Fil-C's memory safety protections (in the unlikely event that they found a bug in Fil-C itself!), then they could use those other threads to execute syscalls that bypass the filter!

To prevent even this unlikely escape, the Fil-C runtime's wrapper for prctl implements PR_SET_NO_NEW_PRIVS and PR_SET_SECCOMP by handshaking all runtime threads using this internal API:

/* Calls the callback from every runtime thread. */
PAS_API void filc_runtime_threads_handshake(void (*callback)(void* arg), void* arg);

The callback performs the requested prctl from each runtime thread. This ensures that the no_new_privs bit and the filter are installed on all threads in the Fil-C process.

Additionally, because of ambiguity about what to do if the process has multiple user threads, these two prctl commands will trigger a Fil-C safety error if the program has multiple user threads.

Conclusion

The best kind of protection if you're serious about security is to combine memory safety with sandboxing. This document shows how to achieve this using Fil-C and the sandbox technologies available on Linux, all without regressing the level of protection that those sandboxes enforce or the memory safety guarantees of Fil-C.