（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41189971

说西班牙语的人经常使用含义相互矛盾的单词，但不会产生双重否定。一个例子是说“No hay nada aquí”，它直接翻译为“这里什么都没有”。然而，这句话暗示着“这里什么都没有”。在“不”和另一个否定词同时出现的某些情况下，也会出现类似的情况，从而保留了句子的整体负面情绪。这些术语并不相互排斥，违反英语语法规则。在计算机系统中使用“eBPF”是可选的，但比 Dtrace 等替代方案更受青睐，因为它提供了更大的灵活性。来自 Facebook、Amazon、Apple、Netflix 和 Google (FAANG) 等公司的工程师经常在工作中使用 eBPF，在多个服务器上连续运行大量实例（除了独特的应用程序）。这些公司资助内核开发团队，将这项技术融入到他们的内核中。虽然“eBPF”可能看起来很复杂，并且只在特定情况下才有必要，但它提供了其他方式无法实现的解决方案，特别是对于那些外部内核工程专业知识来说。它允许系统管理员查明问题并有效解决它们。然而，在实现“eBPF”和无限期维护自定义内核补丁之间存在权衡。进行故障排除时，有时最好使用“SystemTap”等替代工具进行调试。与“eBPF”相比，“SystemTap”显得更简单，需要最少的努力来学习。用户可以观察系统行为、跟踪函数调用并设置断点以进行分析。 “SystemTap”相对于“eBPF”的潜在优势在于其易于学习和解决实际问题的有效性。尽管很简单，但它仍然能够解决关键问题，同时为开发人员提供足够的控制来提高性能。然而，“eBPF”由于能够以提升的权限执行任意代码而带来额外的风险。为了减轻这些危险，程序必须在加载之前通过验证过程，确保满足所有“eBPF”安全要求。不幸的是，“eBPF”验证器中出现了多个漏洞，导致隔离环境中的本地权限升级或容器逃逸。开发人员应该仔细权衡使用“eBPF”的利弊，将安全问题放在首位。相比之下，“DTrace”则表现出

A reminder that on the platforms eBPF is most commonly used, verifier bugs don't matter much, because unprivileged code isn't allowed to load eBPF programs to begin with. Bugs like this are thus root -> ring0 vulnerabilities. That's not nothing, but for serverside work it's usually worth the tradeoff, especially because eBPF's track record for kernel LPEs is actually pretty strong compared to the kernel as a whole.

In the setting eBPF is used today, most of the value of the verifier is that it's hard to accidentally crash your kernel with a bad eBPF program. That is comically untrue about an ordinary LKM.

The PoC uses eBPF maps as their out-of-bounds pointer, but it sounds like it would also be exploitable via non-extended BPF programs loadable via seccomp since it's just improper scalar value range tracking, which doesn't require any privileges on most platforms.

And, of course, root -> ring0 is less of a problem with unprivileged user namespaces where you can make yourself "root", as we've seen in every eBPF bug PoC since distros started turning that on (and have since turned it off again, mostly)

LMAO

Ok that's fair. check_seccomp_filter actually has a more restrictive list than just "BPF with no backwards jumps", and in particular doesn't allow BPF_IND in the BPF_LDX, so you can't read out of bounds because you can't use a dynamic displacement...but BPF_STX is allowed, so you can probably write out of bounds? BPF_W is the seccomp_data address and the control flow diagram they show to compute incorrect scalar ranges doesn't require any backwards jumps...

Let's not forget also that we can give CAP_BPF to containers. With things like Cilium on the rise, the attack vector of landing in container environment that has cap_bpf is more and more realistic

I don't believe shared-kernel container systems are real security boundaries to begin with, so, to me, a container running with CAP_BPF isn't much different than any other program a machine owner might opt to run; the point is that you trust the workload, and so the verifier is more of a safety net than a vault door.

That pessimistic view is not shared by everyone who is working on namespaces, cgroups, etc so I think that’s a pretty unproductive comment in this context.

It reminds me of early days in hypervisors when someone would get an exploit to break out of the isolation and someone would dismiss it because “virtual machines aren’t real isolation anyway”.

Look, I get it and I frankly agree with you in the current state of the world, but this is the time to shut up and get out of the way of people trying to make forward progress. Breakouts of containers are a big deal for people pushing the boundary there.

I don't know who you're really talking to (it's not me), but all I'm saying is that CAP_BPF doesn't bother me much, because it's problematic only for a security boundary that is already problematic with a much lower degree of difficulty for attackers than the eBPF verifier.

Verifier bugs matter for the kernel, which wants eBPF to be secure even for unprivileged accounts.

Verifier bugs don't matter that much, for most Linux users, right now, because unprivileged accounts can't use eBPF.

In Spanish, it's common for double negatives to not actually be double negatives. For example, if you wanted to say "there's nothing here", you'd say "no hay nada aquí", which word-for-word means "there's not nothing here".

Checking out the Royal Spanish Academy, here's what they say about it:

https://www.rae.es/espanol-al-dia/doble-negacion-no-vino-nad...

> The so-called "double negation" is due to the obligatory negative agreement that must be established in Spanish, and other Romance languages, in certain circumstances (see New Grammar, § 48.3d), which results in the joint presence in the statement of the adverb no and other elements that also have a negative meaning.

> The concurrence of these two "negations" does not annul the negative meaning of the statement.

Same in French: "Je ne sais pas" means I do not know, not I do not not know (aka I know).

In any case, the meaning of the sentence above: "uno no es ninguno" in Spanish is clearly one is not zero, or one is not none, or one is different than none.

"Uno no es nada" could be "one is nothing", and "one is not nothing". It all depends on the frame of reference (in this case English), but for this sentence, the "one is not none" is correct IMO. I would never even do a second pass on that sentence, as a native Spanish speaker (appeal to authority, I know)

The one time I tried to use eBPF it wasn't expressive enough for what I needed.

Does the limited flexibility it provides really justify the added kernel space complexity? I can understand it for packet filtering but some of the other stuff it's used for like sandboxing just isn't convincing.

There are other technologies for this, such as DTrace. The kernel's choice isn't eBPF or nothing, it's eBPF or something else like it.

You may not use it much, but some people use it all day. I think FAANG engineers have said that they run tens (hundreds?) of these things on all servers, all the time. And that's excluding one-offs. And FAANG has full time kernel coders on staff, so they're also funding this complexity that they use.

But also yes, I've solved problems by using eBPF. Problems that are basically unsolvable by non-kernel-gurus without eBPF. I rarely need it. But when I need it, there's nothing else that does the trick.

In some cases, even for kernel gurus, it's a choice between eBPF or maintaining a custom kernel patch forever.

> There are other technologies for this, such as DTrace. The kernel's choice isn't eBPF or nothing, it's eBPF or something else like it.

To add on this point: I successfully used SystemTap a few years ago to debug an issue i was having.

Before going further: keep in mind that my point of view (at the time) was the one of somebody working as a devops engineer, debugging some annoyances with containers (managed by Kubernetes) going OOM. I'm no kernel developer and I have a basic-good understanding of the C language based on first-years university course and geekyness/nerdyness. So in this context I'm a glorified hobbyist.

Learning SystemTap is easier in my opinion. I followed a tutorial by RedHat to get the hang of the manual parts but after that I remember being fairly easy:

1. Try to reproduce the issue you're having (fairly easy for me)

2. Skim the source code of the linux about the part that you think might be relevant (for me it was the oom killer)

3. Add probes in there, see if they fire when you reproduce the issue

4. Look back at the source code of the kernel and see what chain of data structures and fields you can follow to reach the piece of information you need

5. Improve your probes

6. If successful, you're done

7. Goto 4

I think it took like one or two days between following the tutorial and getting a working probe.

It was a pleasant couple of days.

DTrace and eBPF are "not so different" in the sense that dtrace programs / hooks are also a form of low-level code / instruction set that the kernel (dtrace driver) validates at load. It's an "internal" artifact of dtrace though, https://github.com/illumos/illumos-gate/blob/master/usr/src/... and to my knowledge, nothing like a clang/gcc "dtrace target" exists to translate more-or-less arbitrary higher-level language "to low-level dtrace".

The additional flexibility eBPF gets from this is amazing really. While dtrace is a more-targeted (and for its intended usecases, in some situations still superior to eBPF) but also less-general tool.

(citrus vs. stone fruit ...)

I’m curious which part of these tenets would feel would have prevented the bug demonstrated, besides “oh we tried harder”? I don’t see any of those that seem unique to DTrace other than limiting where probes can be placed.

Well, we didn't merely "try harder" -- we treated safety as a constraint which informed every aspect of the design. And yes, treating safety as a constraint rather than merely an objective results in different implementation decisions. From the article:

This working model significantly increases the attack surface of the kernel, since it allows executing arbitrary code at a high privilege level. Because of this risk, programs have to be verified before they can be loaded. This ensures that all eBPF security assumptions are met. The verifier, which consists of complex code, is responsible for this task.

Given how difficult the task of validating that a program is safe to execute is, there have been many vulnerabilities found within the eBPF verifier. When one of these vulnerabilities is exploited, the result is usually a local privilege escalation exploit (or container escape in containerized environments). While the verifier’s code has been audited extensively, this task also becomes harder as new features are added to eBPF and the complexity of the verifier grows

DTrace was developed over 20 years ago; there have not been "many vulnerabilities" found in the verifier -- and we have not grown the complexity of the verifier over time. You can dismiss these as implementation details, but these details reflect different views of the problem and its contraints.

No, like, the bug that was demonstrated seems to be fairly fundamental to running any sort of bytecode in the kernel: they need to verify all branches, and this is potentially slow, so they optimize it (which is where the bug is). What are you doing differently? It seems to me that you’re either not going to optimize this or you are?

The DTrace instruction set is more limited than that of the eBPF VM; eBPF is essentially a fully functional ISA, where DTrace was (if I'm remembering this right) designed around the D script language. An eBPF program is often just a clang C program, and you're trusting the kernel verifier to reject it if it can't be proven safe. Further: eBPF programs are JIT'd to actual machine code; once you've loaded and verified an eBPF program, it has conceptually all the same power as, say, shellcode you managed to load into the kernel via an LPE.

That's not to say that security researchers couldn't find DTrace vulnerabilities if they, for instance, built DIF/DOF fuzzers of 2023 levels of sophistication for them. I don't know that anyone's doing that, because DTrace is more or less a dead letter.

For those who read this thread - DTrace is in use in Solaris and in Illumos, and various of us who use Illumos for our production use cases (like Oxide does) still very much use DTrace.

I appreciate the rest of tptacek's comment which is informative. I also acknowledge that there may not be fuzzers written that have been disclosed.

Oh, sorry, totally fair call-out. There's like a huge implicit "on Linux" thing in my brain about all this stuff.

I'd also be open to an argument that the code quality in DTrace is higher! I spent a week trying to unwind the verifier so I could port a facsimile of it to userland. It is a lot. My point about fuzzers and stuff isn't that I'm concerned DTrace is full of bugs; I'd be surprised if it was. My thing is just that everything written in memory unsafe kernel code falls against Google Project Zero-grade vulnerability research, at some point.

That's true of the rest of the kernel, too! So from a threat perspective, maybe it doesn't matter. I think my bias here --- that's all it is --- is that neither of these instrumentation schemes are things I'd want to expose to a shared-kernel cotenant.

Thanks for helping me clarify this.

The DTrace bytecode VM is simply more limited:

  - it cannot branch backwards (this is also true of eBPF)
  - it can only do ternary operator branches
  - it cannot define functions
  - functions it can call are limited to some builtin ones
  - it can only scribble on the one pre-allocated probe buffer
  - it can only access the probe's defined parameters

If the verifier can prove to itself that a loop is bounded, it'll accept it. A good starting place for eBPF itself: if a normal ARM program could do it, eBPF can do it. It's a fully functional ISA.

It depends on what you're using it for. If you want to expose this to untrusted code, yes, but I wouldn't be comfortable doing that with DTrace either.

There's two untrusted code cases here: untrusted DTrace scripts / users, and untrusted targets for inspection. The latter has to be possible to examine, so the observability tools (like DTrace) have to be secure for that purpose. This means you want to make it difficult to overflow buffers in the observability tools.

There's also a need to make sure that even trusted users don't accidentally cause too much observability load. That's why DTrace has a circular probe buffer pool, it's why it drops probes under load, it's why it pre-allocates each probe's buffer by computing how much the probe's actions will write to it, it's why it doesn't allow looping (since that would make the probe's effect less predictable), etc.

Bryan, Adam, and Mike designed it this way two plus decades ago, and Linux still hasn't caught up.

Linux has a different design than DTrace; eBPF is more capable as a trusted tool, and less capable for untrusted tools. It doesn't make sense to say one approach has "caught up" to the other, unless you really believe the verifier will reach a state where nobody's going find verifier bugs --- at which point eBPF will be strictly superior. Beyond that, it's a matter of taste. What seems clearly to be true is that eBPF is wildly more popular.

It's really hard to bring a host to its knees using DTrace, yet it's quite powerful for observability. In my opinion it is better to start with that then add extra power where it's needed.

I understand the argument, but it's clear which one succeeded in the market. Meanwhile: we take pretty good advantage of the extra power eBPF gives us over what DTrace would, so I'm happy to be on the golden path for the platform here. Like I said, though: this is a matter of taste.

> I've solved problems by using eBPF. Problems that are basically unsolvable by non-kernel-gurus without eBPF. I rarely need it.

Would you mind giving some examples? I recently started learning about ebpf's from Liz Rice's book and is curious about what makes ebpf the correct choice in a particular scenario.

I'm not sure "Google engineers use it" is a very good counter argument. They have a very high tolerance for complexity and like most large corporations what actually gets built and used tends to be driven more by internal politics than technical merit.

I don't mean it as a counter argument, or I don't think the way you mean it, at least.

You may not use it at your smaller scale. But there are millions of machines out there that do use it, and the alternative for the same functionality is much worse.

I bet you never use SCTP sockets either. eBPF is used much more than SCTP.

And its users "fund" its development, so it's not a burden to those who don't use it.

But are you sure your systems don't use it? Run "bpftool prog" to see. Whatever you see there someone thought was better than the alternative.

Wouldn't even the classic loadable kernel mode driver be a better choice than a patch and eBpf? I know they are unsafe but people who deal with it, know the power comes with responsibility.

No? SREs roll eBPF programs on the fly just in the process of debugging problems; if you tried to do that with an LKM, you'd almost certainly blow up your system. People who write Linux kernel code routinely crash their systems in the process of development.

In my country we have a saying. "Porcupine in the pants". Sounds like for all the good it can do, it isn't written safely and carefully.

（评论） (comments)

（评论）
(comments)