（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41289114

在计算机编程中，“goto”语句和“尾部调用”是控制程序流程的技术。尾调用用于递归调用函数，确保递归调用返回后可以丢弃原始函数调用帧。尾调用的优点在于能够最大限度地减少递归期间的内存使用，从而允许更深入的递归，而不会耗尽系统资源。然而，它们引入了一个称为“堆栈溢出”的挑战，即过多的嵌套导致内存不足以存储函数帧。为了缓解这个问题，开发人员可以采用“蹦床”或“连续传递”等技术，以提供调用函数和从函数返回的替代方法。另一方面，传统的“goto”语句可以跳转到程序中的任意位置，这可能会导致代码复杂、难以理解和维护。它们通常用于较旧的编程语言中，并且由于它们对代码清晰度、调试和整体软件可靠性的潜在负面影响，现在被认为是不好的编码实践。一种称为“计算转到”的新技术通过将标签与函数处理程序相关联，简化了跳转到程序中特定点的过程。通过使用计算的 goto，程序员可以有效地实现尾调用的好处，同时减少直接跳转到预定义标签的需要。尽管计算 goto 和尾调用在程序员中仍然是一个有争议的话题，但它在促进更高效的代码组织方面具有共同点。

Thanks for the pointer, I'll have to check this out.

Can you elaborate on how "return goto" is easier to implement? [[musttail]] also ends the lifetime of local objects AFAICS.

One thing I'll mention from a quick scan:

> [The] function called in tail position must have identical type to the callee. This ensures both that the return value does not require any conversion, and also that argument passing space is available and calling convention (if relevant) is maintained.

One complaint I've seen repeatedly about [[musttail]] (which I implemented in Clang) is that this constraint is unnecessarily strict, since some architectures will allow tail calls even for functions that do not perfectly match: https://github.com/llvm/llvm-project/issues/54964

"But then the code would be nonportable." True, but tail call optimization is inherently nonportable, since some targets fundamentally do not support tail call optimization (eg. WASM without the tail call extension).

>[[musttail]] also ends the lifetime of local objects AFAICS.

That's good to know. I had this github issue [0] in the back of my mind, as well as witnessing occasions of clang turning [[musttail]] into inner loops, and concluded clang's implementation must be more sophisticated than simply replacing calls with jumps. Just a little paranoia from trying to be serious with compiler dev[1]: fulfilling a laid-out spec feels more sound versus imitating something out there.

>this constraint is unnecessarily strict

I would agree, at least for x86 psABI, it can be pretty elaborative as long as the return value is the same register and argument stack don't exceed what's provided. Debug/profiling side might hate it, though.

[0] https://github.com/llvm/llvm-project/issues/72555 [1] https://github.com/fuhsnn/slimcc/

> some targets fundamentally do not support tail call optimization

Can't any tail call be rewritten as a loop? Couldn't a WASM compiler without the tail call extension implement it this way?

No -- as discussed upthread, clang's musttail attribute requires the target function to have the same number of arguments as the caller and for each argument to be similar to the corresponding caller argument. That's stricter than the underlying LLVM musttail marker (when targeting the tailcc/swifttailcc calling conventions) and is too restrictive to implement Wasm's tail-call feature (and probably Scheme's, etc.), at least if arguments are getting passed to functions natively.

It would be nice if the more relaxed rules of the LLVM musttail marker with tailcc could be exposed in clang (and gcc). I think that's basically what "return goto" would do.

It can't be rewritten as loop due to function pointers. Using JS notation to avoid noise:

function logAndCall(statement, func) { console.log(statement); return func(); }

Tail call optimization is actually possible here since we call func in "tail position". It might be unlikely to blow up the stack, but it can definitely happen if you do a lot of continuation passing.

Perhaps more commonly for C++/Rust, tail call optimization would be enormously valuable to have behind the scenes for destructors. It's actually very difficult to implement a linked list in safe Rust that doesn't explode the stack for large lists since no TCO is done for destroying subobjects. You can still avoid stack overflow, but you have to do things like manually enumerating the list.

I don't get the problem. The generated code for your example would return the same function pointer + evaluated args list to the wrapping trampoline (i.e. the code that contains the loop), where that loop expects each thing it invokes to return a sum type NextInvoke(func, args) | Return(value).

And that's not a special case for passed-in function pointers; that's what a trampoline always does. Trampolines expect static-invoked tail-calls to be codegen'ed into function-pointer references too.

More specifically, tail recursion is usually easy to turn into a loop. Tail calls can be difficult to turn into loops when they call a different function, or worse a function passed in as a variable.

To give the standard example:

Consider a state machine where each state is a function, and you transition into a different state by tail-calling another function.

State machines can have arbitrarily complicated graphs, that would be hard to put into a simple loop.

(However, you can do something like a 'trampoline' to remove the mutual recursion.)

As an aside, I’m excited - but also with lots of trepidation - about the new energy in adding new functionality to C. Excited because there are changes and additions that are crying to be made, even clarifying ideas… trepidation because C++’s gung ho update cadence has somehow ended up in wart patched over wart. Especially when feature reacts with feature in an unfelicitous way far earlier than anticipated. I hope the standards process finds a way to be very conservative, really thoroughly test features in large and diverse codebases, rather than just relying on rationales alone, when choosing to include feature.

What new energy? We have had C89, C99, C11, C17, C23

And yet, no fix in sight for proper strings, arrays, or at very least bounds checked slices, like Dennis Ritchie originally proposed to ANSI/ISO.

The way interpreters usually achieve this kind of speedup in C++ is using computed goto. Then there’s no calling convention junk on the path from one opcode to the next.

Also, the main reason why interpreters get faster using either the computed goto style or the tail style versus a classic switch loop is that it reduces pressure on the branch predictor since there’s statically one indirect branch per opcode rather than statically just one indirect branch.

(As the article claims) even with computed goto register assignment of the most frequently used variables is fragile because the CFG of the function is so complicated.

Register assignment is much less fragile when each function is small and takes the most important variables by argument.

It's also fragile, in a different way, if you're threading state through tail calls.

In my experience writing computed goto interpreters, this isn't a problem unless you have more state than what can be stored in registers. But then you'd also have that same problem with tail calls.

Fallback paths most definitely have more state than what can be stored in registers. Fallback paths will do things like allocate memory, initialize new objects, perform complicated fallback logic, etc. These fallback paths will inevitably spill the core interpreter state.

The goal is for fast paths to avoid spilling core interpreter state. But the compiler empirically has a difficult time doing this when the CFG is too connected. If you give the compiler an op at a time, each in its own function, it generally does much better.

I get that and that’s also been my experience, just not for interpreters.

In interpreters, my experience is that fallback paths are well behaved if you just make them noinline and then ensure that the amount of interpreter state is small enough to fit in callee save regs.

There are a bunch of arguments in there that don't match my experience, which includes the JSC interpreter. JSC had an interpreter written in C++ and one written in assembly, and the main reason for using the assembly one was not raw perf - it was so the interpreter knows the JIT's ABI for fast JITinterpreter calls.

Mike's argument about control flow diamonds being bad for optimization is especially questionable. It's only bad if one branch of the diamond uses a lot more registers than the other, which as I said, can be fixed by using noinline.

As I mentioned in another part of the thread - the way you get that under control in a computed goto interpreter (or a switch loop interpreter) is careful use of noinline.

Also, it probably depends a lot on what you’re interpreting. I’ve written, and been tasked with maintaining, computed goto interpreters for quite a few systems and the top problem was always the branches and never the register pressure. My guess is it’s because all of those systems had good noinline discipline, but it could also just be how things fell out for other reasons.

> there’s statically one indirect branch per opcode rather than statically just one indirect branch.

Could you elaborate on this with a couple more paragraphs? What do you mean by one indirect branch per opcode rather than just one indirect branch? How is this achieved?

Instead of a for loop where you hit a switch at the top of the loop, you jump to the code block for next instruction from the end of the previous instruction. That stops you from jumping to the top of the loop and then jumping a second time.

On older CPUs, you’re less likely to hit a pipeline stall. This technique was called “super-pipelining” for that reason. But a few years ago when this was discussed, someone pointed out that’s usually not necessary anymore. That branch predictors can see through double jumps now.

But as I alluded to in another reply, CPU caches are finite, and I have doubts whether in a fully realized interpreter, particularly one living side by side with a JIT, if that microbenchmark is lying to you about how fast the inner loop is under production conditions.

Sure.

Say you write an interpreter like this:

    for (;;) {
        switch (curInst->opcode) {
        case Inst::Foo:
            setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b));
            curInst += FooSize;
            break;
        case Inst::Bar:
            setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b));
            curInst += BarSize;
            break;
        ... more stuff like this
        }
    }

Then the generated code will have an indirect branch for the switch. This is costly for two reasons:

- Gotta load curInst->opcode, then lookup opcode in a compiler-generated jump table to get the code address of the case statement.

- Gotta indirect jump.

It turns out that the top cost (or even all of the cost) is the indirect jump. Maybe, the indirection is also some cost, at least on some CPUs. But remember, all jumps on modern CPUs are fast if predicted - regardless of the work seemingly required to do the jump. And they are slow if they are not predicted - and that slowness is about much more than the work required to do the jump (the cost is the CPU needing to roll back its speculations).

Reason: the indirect jump prediction the CPU is doing is keyed by the address of the indirect jump instruction. There is one indirect jump in the code above. And for any real program you'll execute, that indirect jump will end up hitting all (or at least most) of the opcodes. Therefore, the CPU has no easy path to predicting the indirect jump. Maybe it'll sometimes get lucky with more sophisticated predictors (like ones that track history and use that to salt the key used to lookup the predicted destinations, or ones that do more fancy stuff, like maybe some neural network to predict).

How do you make the indirect jump faster? Have one indirect jump per instruction! Both the computed goto approach and the tail call approach get you there.

Consider the computed goto version of the interpreter.

    Inst_foo_label:
        setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b));
        curInst += FooSize;
        goto *curInst->handler;
    Inst_bar_label:
        setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b));
        curInst += BarSize;
        goto *curInst->handler;
        ... more stuff like this

In this formulation of the interpreter, there is one indirect jump per instruction, rather than just one for the interpreter. Therefore, we're asking the CPU's predictor to do something simpler: to predict, for each instruction, what the next instruction will be. And then on top of that, the predictor still gets to use its neural network, or history buffer, or whatever. In terms of mathematical probability theory, it's like we're giving the CPU a first-order Markov chain, and that's sure to improve prediction accuracy. Empirically, it improves it a lot and it's a big speed-up. Here's yet another way to think of it. If I asked you to predict, "what's the next instruction", without any knowledge of what the prior one was, then you'd have a hard time - you'd only be able to reason about which instructions are most likely generally. But this computed goto interpreter is instead asking: "if I tell you the instruction I just executed, what's the next instruction", which gives you more leverage. Maybe adds are often followed by moves, for example.

The tail call style also achieves this, because each instruction's handler will have an indirect tail call (literally an indirect jump) to the next instruction, which again, gives the CPU that Markov chain goodness. So let's call both the computed goto style and the tail call style the "Markov style". If you could rewrite the switch statement so that there was one switch statement per instruction (and you could convince the compiler not to combine them into one, lmao) then that would also be a Markov-style interpreter.

As for the cost of indirectly loading from the compiler's switch table, or other issues like pushing and popping registers: in my experimentation, these costs are like drops in the ocean compared to the cost of indirect jumps. Even with the Markov style interpreter, the CPU spends most of its time mispredicting and rolling back. So the details of how the work happens for individual instructions are usually less important than what you do about the prediction of that indirect jump.

Awesome, thank you for expanding. Now I see the intuition that branch prediction accuracy could be much higher using the knowledge of the last opcode, so this becomes a game of massaging the code to prod the CPU to use more inputs into its branch prediction. Also helpful color on your empirical observation that branch prediction accuracy dominates other factors like switch indirection and loading registers.

There's one thing I'm still missing. How exactly do we force the CPU to use the last opcode in its branch prediction model? In your first switch example, the CPU "knows" the path it has followed to get to each iteration, so in theory it could use the information of the last node it visited (or the last two nodes, etc.) to aid branch prediction right?

Related to that: in your second example, what exactly happens in `goto *curInst->handler;`? Doesn't this need to revert back to something like a switch statement which has the same problem? (Unless you are doing polymorphism / dynamic dispatch in that example? Which I assume has some performance penalty that you're saying is dwarfed by the extra branch prediction effectiveness). Analogous to the line in the OP's article that says `MUSTTAIL return dispatch(UPB_PARSE_ARGS);` - doesn't the generic dispatch function need another switch statement? Probably missing a couple obvious things here.

Lastly - if you have any books/article recommendations that helped you learn some of these intricacies (esp. regarding intuition about which performance quirks matter vs. don't matter) that would be great as well. Thanks!

> How exactly do we force the CPU to use the last opcode in its branch prediction model?

In the tail call case: because each opcode is implemented by a function that has an indirect tail call at the end.

In the computed goto case: each `goto curInst->handler` is its own indirect branch.

> Related to that: in your second example, what exactly happens in `goto curInst->handler;`?

`curInst->handler` is a pointer that points to some label. The goto is an indirect branch to exactly that pointer value, i.e. that label.

It's a super crazy and dangerous feature! It obviates the need for a switch; it is the switch.

I’ve heard this went out of fashion a while back. That branch predictors got good enough that it’s not necessary anymore.

But I wonder if that stays true as the size of the interpreter increases.

My most recent data says it's still relevant.

It might not matter for very small interpreters, but it does matter for anything substantial.

Definitely worth remeasuring though.

I would advocate that anything substancial is better off with a JIT, even a dumb one e.g. template JIT, we aren't dealing with 8 bit home computers memory constraints any longer, except in some IoT deployments.

But then you’ll still have an interpreter for fast start.

That’s why JSC and V8 have interpreters as their bottom tiers.

And you’ll also want an interpreter if you don’t have permissions to JIT, which is common these days.

The only place where there isn't a JIT permission are iDevices, and we all know that security isn't really the main reason.

Sure modern security concerns make use of W^X, and implementations have adapted accordingly.

Android has an interpreter hand written Assembly for fast start since Android 7, with a jump into JIT compilation as fast as possible, because it has been proven to be a better solution in the field.

The point being that a fast start doesn't really require an interpreter to win out micro-benchmarks game, rather be good enough until the tiered JITs step in.

Clang recently got a new calling convention that makes these tail calls much cheaper (avoids the need for the caller to preserve some registers). I can never remember the name - it’s either preserve_all or preserve_none (whose perspective is the preservation from?).

preserve_none is the new one. It can be applied to the functions performing tail calls to allow them use of the full register space.

I even saw an enhancement recently that will make preserve_none allocate arguments in the registers that are usually callee-saved: https://github.com/llvm/llvm-project/pull/88333

This will make [[musttail]] + preserve_none a winning combination when used together, particularly when making non-tail calls to fallback functions that use a regular calling convention, because all the arguments to [[musttail]] functions can stay pinned in the callee-save registers.

I'm delighted, because this matches what I originally proposed back in 2021, except I called it "reverse_ccc" instead of "preserve_none": https://github.com/haberman/llvm-project/commit/e8d9c75bb35c...

preserve_all also exists, and has existed for a while. You could use it on fallback functions to help the tail calling functions avoid spilling. But this always seemed like an unsatisfactory solution to me, because it's too intrusive (and requires too much diligence) to tag a bunch of regular functions as preserve_all. It's much more practical to tag all the core interpreter functions as preserve_none.

I see signs that Google itself is using preserve_none internally, since the public protobuf repo has PROTOBUF_CC (Calling Convention) but it is defined as nothing

  #define PROTOBUF_CC

Is there any chance of this getting out into the wild or is it too dangerous for us mortals?

Since preserve_none is Clang-only and only available on recent compilers, it would introduce an ABI hazard between the core library and generated code. We don't want to cause crashes if you compile protobuf with one compiler and generated code with another.

Also, until quite recently preserve_none was incompatible with the sanitizers, but I believe this may have been fixed by: https://github.com/llvm/llvm-project/pull/81048

I’ve seen a few languages over the years drop and reacquire JIT layers. Some of it has to do with programmer skill and lessons learned, but some is also down to CPU generation.

Like everything else in CS, when the cost balance shifts for different kinds of operations, the best algorithm can shift back to something we haven’t used in fifteen, twenty years. It contributes a lot to the faddishness of programming. Just because we are bringing something back doesn’t mean there’s no reason. But we forget the reasons it wasn’t a panacea last time so that’s still a problem.

If your main JIT gets faster or slower, then the cost-benefit for running it changes, so the threshold to trigger it gets adjusted, and now the amount of code that runs in the other tiers shifts, which might make the amortized cost of that tier worse. It’s like balancing a double pendulum.

If you can make a JIT tier fast and dirty enough, you can skip the interpreter entirely. And, from my armchair position, it seems that the cognitive load of bookkeeping tasks between the interpreter and say two JITs is high enough that a few languages have mothballed the interpreter and used a JIT optimized for compile time not output speed.

And I don’t recall what language, but I’m pretty sure at least one team that did this ended up dropping an intermediate compiler as well, because of that balancing act I mentioned above. It was better to double down on two than to try to handle three.

> I very much hope that the attribute will catch on, spreading to GCC, Visual C++, and other popular compilers,

AFAIK, attribute musttail is in the process of being added to GCC (the patch is under review) with semantics compatible with clang.

What about the preserve_most attribute? Is there any chance something like that will get into GCC? Without it, the non-tail calls ruin the interpreter.

Maybe, but not there yet: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110899

In the meantime some inline assembly macro trickery might help.

edit: code duplication can also be obviated by templating your op function with a fast/slow parameter, with the fast variant tail-calling the slow variant when it cannot perform the fast path, while guarding the slow code via the compile time parameter. The downside is yet more code obfuscation of course.

It's a hard problem because many ABIs cannot do tail calls even for very basic stuff, like calls to extern functions with matching argument and return types. It looks like Clang has some heuristics for changing call sequences for musttail calls. For example, it switches to noplt calls on i686. None of this is mentioned in the Clang documentation: https://clang.llvm.org/docs/AttributeReference.html#musttail

What's realistic here is that you get a diagnostic if the compiler cannot generate a tail call. For many users, that will likely be good enough. Guaranteeing a tail call as in Scheme is unlikely to happen.

It mentions C++ support; it would seem to me that C++ has very few tail-calls. Consider:

  foo()
  {
    auto a = SomeClassWithADestructor();
    return bar(); // This is not a tail-call because the destruction of a happens *after* the call to bar();
  }

Maybe the example is too simple, but it does not require `__attribute__((musttail))` for good code gen.

Also if the error handling function is unlikely, you wouldn't care too much about how fast it is to call it?

To me it seems like a structure of the form

   

   if (unlikely(malformed))
     return error();

   

   switch (data_type) {
     case x:
       return handle_x();
     case y:
       return handle_y();
   }

generates a nice jump table quite reliably.

Obviously compilers have been doing tail call elimination for ever, but for this trick "generates [tail calls] quite reliably" is not enough: you have to GUARANTEE it (or fail compilation), otherwise this structure does not work (it will blow the stack immediately). That's the point of [[musttail]], tail call elimination is required, that's the only choice the compiler has.

"you have to GUARANTEE it (or fail compilation)"

I've often pondered the utility of similar flags for other optimizations. This is perhaps the largest one, but there are other situations in certain code where I want to know that my optimization has failed.

A more complicated example is, I've sometimes wanted an assertion that a given function is inlined. We know from hard, repeated experience over decades that letting users annotate functions as inline directly doesn't end up working very well, but I've often wondered about creating an assertion that would fail the compile if it isn't. (Ideally with at least a hint as to why the compiler failed it out, but that's easier said than done.) Obviously you don't want to go slapping this in some major open source library that's going to be compiled by half-a-dozen compilers on dozens of operating systems, but for my own code in my own situation it can be an optimization that is the difference between success and failure and it'd be nice to flag a failure.

(Bear in mind I am not proposing to blindly take any particular action if it triggers. The point is to bring it up to human attention and not have it buried deep in invisible, yet potentially consequential, compiler decisions. The human deciding "I guess this optimization isn't happening" and removing the annotation would be one valid decision, for instance.)

I agree, this would be useful. Another one I would like is auto-vectorization, a way to mark a loop with an attribute and if it fails to auto-vectorize, the compiler should print out the auto-vectorization report for that loop, explaining why it happened. It's such a brittle optimization, but it's crucial for a tiny number of extremely hot loops, you would want to know if it failed due to some code change or compiler upgrade. Also, it's just a pain to use auto-vectorization report normally.

That means that any such code is not portable across compilers anymore. It is effectively written in a non-standard C dialect, because it requires a language extension to work correctly.

The typical way to deal with this is to put the __attribute__ into a C macro which expands to nothing on compilers which don't understand GCC/Clang's __attribute__ keyword. The code without the attribute will still compile and most likely also apply the tail call optimization, you just don't get an error if the compiler can't apply the optimization.

Also TBF, hardly any real-world C code is strictly standard compliant. Many C compilers just agree on a common syntax that includes both the C standard and some popular non-standard extensions.

PS: C++ compilers actually ignore unknown attributes since the `[[attribute]]` syntax has been standardized in C++11. In GCC and Clang you'll get a warning in the standard warning set, but not in MSVC.

PPS: C23 also standardized the `[[attribute]]` syntax and also added a way to check for supported attributes:

https://en.cppreference.com/w/c/language/attributes

It will compile, and eventually blow up nicely with a stack overflow OS fault.

Ah, the joys of writing "portable" C code with #ifdef spaghetti, across commercial UNIXes and their own C compilers, 20 years ago.

It only got better because for many people just like they assume Web == Chrome, C gets C == GCC, blessifully ignoring everything else.

Nowadays clang is also considered, mostly because a couple of companies wanted to replace GCC, and naturally clang needs to be able to match whatever GCC offers.

> It will compile, and eventually blow up nicely with a stack overflow OS fault.

Not at all guaranteed. Stack overflow is undefined behaviour, which means compilers can optimise your program on the assumption that it doesn’t happen.

Loops are just a special, limited case of recursion.

(And only necessary in languages that have trouble implementing function calls properly.)

Yes, that is correct. You cannot do this trick in standard C, C++ or Rust, it requires some version of [[musttail]]. Strong argument for adding it to the C standard, IMHO.

The portable alternative is being explicit with your loops, possibly in combination with gigantic unwieldy switch statements or some regular goto (it is still part of standard). But that comes at the cost of readability and sometimes performance.Whether it's practical depends on the usecase. For something like recursive data structure algorithms which are relatively small and self contained, I would say it's perfectly doable. Simple interpreters - maybe. Complex interpreters - here it becomes messy.

See Mike Pall’s posts on the subject—the performance cost is considerable, for two reasons. First, you’re forcing the compiler to do register allocation for the whole interpreter at once, which it can virtually never do a good job of. (This is actually the more important part.)

Second, given the existence of branch target buffers (and the ruinous cost of mispredicted branches), you really want the instruction dispatch to be a single indirect branch at the end of each instruction implementation, and for that standard tools are somewhere between unhelpful (you can write a macro containing switch (*insn++) { case INSN_FOO: goto impl_foo; /* ... */ } but it’s anybody’s guess whether you’re getting a single jump table for all copies of that) and actively obstructive (“tail merging” in older versions of Clang would actively destroy any attempts at copying dispatch code). Granted, sometimes things work out (new Clang versions can sometimes go “looks like you’re writing an interpreter” and turn a vanilla switch in a loop into duplicated dispatch code). Then again, sometimes they don’t, and you can’t actually know.

A switch-case is the default way to write an interpreter and I'd even argue it's the most readable.

In the context of this article, it's all about the performance. Switch-case generates suboptimal code for the commonly used fast paths of the protobuf parser, because the mere existence of the slow paths is enough to interfere with the performance of the code around them.

Yeah, that's the way virtual machines have been written forever, some version of

    for instruction in instructions {
        switch (instruction) {
            case OPCODE_X: //....
            case OPCODE_Y: //....
            case OPCODE_Z: //....
        }
    }

This is how VMs have been written since the dawn of time (or using computed gotos, another non-standard addition to C). It has problems though, like the fact that the `switch` branch is extremely unpredictable, and that you get a massive function which is hard to optimize. This [[musttail]] trick is a huge improvement. But yeah, if you got to support compilers that don't have [[musttail]], you in essence have to have two implementations, the [[musttail]] one and the loop/switch one.

The article is pretty clear about this. When it comes to fast lexing and parsing, it is typical for projects to make portability tradeoffs in favor of performance. For example, simdjson is full of assembly.

There's no such thing as "standard C" that you can actually write, due to UB and implementation defined behaviour. There's just (C, compiler version, platform) that defines (if only through the compiler's source code) what will actually happen in any given situation.

Language extensions are a feature, not a bug. They allow C to evolve and C compilers to compete without requiring committee consensus. Good extensions will eventually be picked up by other compilers, and maybe even find their way into the standard.

I love when we discuss C or C++, "Language extensions are a feature, not a bug", but then when discuss other languages that try to do without C, by adding extensios to their language reference, the speech turns into "Language extensions are a weakness, not a feature".

Additionlly "Language extensions are a feature, not a bug" seems only to be valid in the context of C and C++, IF the compilers being discussed are GCC or clang, because God forbid a comercial C or C++ compiler to have language extensions.

Speaking in general about the way these subjects are discussed online, not you in particular.

sure, i get the joke, i just don't like it. it's the same story as browsers. proprietary extensions in the name of progress because technically it's allowed, but also unimplemented standardized features galore, necessitating polyfill libraries and frequent checking of support matrices.

it's just sprawl with popular support.

It is very different. A web browser is a (almost) fully self contained environment. Same with something like JAVA. On the other hand a standard C or C++ compiler+runtime has (by design) very little features. Anything beyond trivial programs has to reach to platform specific features. Even POSIX is often not enough.

As the disassembly in the post demonstrates, the problem with the fallback path (which is not necessarily the error path) is not how fast the call to it is, it's that the mere existence of that call can force the compiler to create a stack frame and spill registers into it for the whole function, including the fast path.

OK, maybe “force” is not the right word—nobody says the compiler has to have a single stack frame structure for all possible execution paths of a function. Nobody even says it has to use the standard ABI for a no-linkage (static or anonymous-namespace) function (that doesn’t have its address taken). But the reality is, all compilers I’ve seen do, including Clang, so we want a way to tell them to not worry about the ABI and avoid wasting time on preserving registers across the call.

Re your nice jump table, sure it does. But if you try running the result under, say, perf report, and your test bytecode doesn’t describe a short loop, you’ll see one of two things: either you had a branch mispredict on each dispatch; or the compiler went “looks like you’re trying to write an interpreter” and moved the indirect jump to the end of each case (I’ve seen Clang do this). And either way the register allocation in the resulting code probably sucks.

> so we want a way to tell them to not worry about the ABI and avoid wasting time on preserving registers across the call

that's what -fvisibility=internal already does, no?

That’s what static could do (if the function’s address is not taken, or given sufficiently powerful dataflow analysis), but C and C++ compilers don’t take advantage of that. Getting that out of -fvisibility=hidden -flto would also be possible, but requires even more nonexistent compiler smarts. (From a quick web search, I can't figure out what internal visibility brings over hidden.)

(Granted, it’s not like this is completely impossible—I seem to remember GHC and MLton can invent custom calling conventions for Haskell and SML respectively. But the popular C or C++ compilers can’t.)

Really? I've always assumed that static gives C/C++ compilers the right to disregard calling conventions altogether. Now that I think about it, this assumption might be unjustified. Was there any strong reason for compilers to keep the calling convention?

I think that "static register" should make the compiler allowed to disregard calling conventions (if optimizations are enabled), but neither "static" nor "register" alone should do so. (In a discussion on an IRC some time ago, I had suggested what "register" alone should do in my opinion, which is something else than this.)

I mean, it does give them that right as far as I know [with, again, the caveat of function pointers—you can’t pass a non-ABI-compliant function to qsort(), however static it is—and an additional one of inline assembly], it’s just that they choose not to exercise it. Why? Not sure. Maybe it’s too difficult (register allocation and spilling is NP-complete and fairly expensive). Maybe it’s too difficult when your compiler starts out as a function-at-a-time one and builds on that. The popular ones don’t do that, is my point. (Would be interesting to check what SDCC does, by the way, on register-starved micros such tricks could make sense.)

I wonder how fast this would be when using a trampoline, i.e. returning the next function as a function pointer and calling that from an outer loop. That would have the advantage of being portable C.

C is often used as target language for compiler from higher level languages. The Scheme programming language requires all tail calls not to grow the stack. Therefore implementors have explored various techniques including trampolines. I don't have a citation, but you can find the answer in the papers on compiling Scheme to C. If there is no guarantee of TCO in the target language, then the generated programs will be slower.

Incidentally, this is why implementors of (especially) high level languages are annoyed that TCO was removed from the JavaScript specification. There are even solution for having TCO and still have stack inspection.

https://github.com/schemedoc/bibliography/blob/master/page8....

the reason i suspect tail call optimization is fast would be because the resultant loop is predictable and thus CPU instruction prefetch and memory prefetch works very well.

Jumping via function pointers would probably not be as predictable, and you'd unlikely to see the same benefit.

Of course, one must measure, and i haven't.

The tail call interpreter is also calling through a function pointer. The cost here is purely the call+ret overhead, which can be non-trivial when it is per opcode; on some micro-architectures there is also a limit on taken jumps per cycle (sometimes as low as one taken jump every other cycle).

edit: trampolining would also collapse all indirect jumps to a single source address which is not ideal for branch prediction.

One of the compilation passes in a Scheme compiler (e.g. Guile or Racket) is conversion to continuation passing style. Here the author applies the pass manually as a code design decision, because later passes of the compiler (i.e. the actual C compiler) produce better code given that input. It's neat.

Could someone clarify the example where an unlikely fallback path is wrapped in a musttail?

My understanding is that musttail is useful because it prevents stack frame construction and register spilling; the calling function basically says "you can re-use the existing stack and not worry about my local variables."

But doesn't the stack frame construction overhead / register spilling only occur when the fallback path is actually invoked; therefore if it is unlikely and you care less about performance in this slow path it doesn't matter if you wrap the fallback path in musttail?

(Is this purely a branch prediction effect, where the possibility of extra work needing to be done slows down the fast path even when the work does not need to be done?)

If the fallback path can be invoked repeatedly for one message (depending on message size), the tail call is a correctness issue because your stack is probably not going to be large enough to keep all the fallback path frames.

If added to Clang then I would assume this means it got support from LLVM. If so, does this mean that Rust can now rely on tail calls in LLVM to support something like the `become` keyword?

The problem with doing it in rust is that most calls aren't tail calls, even if they look like tail calls. You need to invoke the destructors for any code path that can drop.

Isn't that the purpose of `become`? I thought it was to say "this IS a tail call, error out if it is not". After that validation is done, then the compiler can drop as needed.

The compiler can't drop as needed, because the drop prevents things from being tail calls. A single drop in a function prevents any calls from being tail calls, and therefore, they can't be eliminated.

In idiomatic rust, this means very few functions can use become.

The musttail return is not allowed to occur within a try block, or within the scope of variables with nontrivial destructors [0]. (In particular, this means that the function parameters cannot have nontrivial destructors.) Otherwise, within a musttail-called function, it's as if the original caller directly called that function, so exceptional stack unwinding is unaffected.

[0] https://www.mail-archive.com/[email protected]/msg95265.html

I wonder how this behaves in combination with __attribute__((cleanup(...))). Especially if the to be cleaned variable is passed into the tail function as parameter.

You get an error that tail call can't be performed which is kind of the point of tailcall attribute. Same thing with regular c++ destructors or dozen other language features which interfere with tail calls, no need for extensions like __attribute__((cleanup()).

https://godbolt.org/z/Yex74Wjz7

I wrote C Protobuf decoder / encoder as well as IML parser and bindings for Python. Here's something I have to say about measuring parsing speed.

* Since this library is only offered as a binding to some managed languages there are some extra caveats that, in term of performance overshadow everything else. I cannot speak to Ruby or PHP, but with Python I saw a dramatic speed increase when not using enumerators. I.e. if you translate Protobuf's enumerators into Python's enumerators any possible gains you may have in your C code will be trampled by creation times of different Python objects. The difference is many orders of magnitude. Similarly, you could go even further and implement all the supporting data-structures in C, and only expose minimal interface to them in Python. How fair would this sort of comparison be against code that uses Python's built-in structures is the question I cannot answer.

* Google's Protobuf parser for Python would be still "faster" than 2+GB/s because... it doesn't parse anything beside the top-level message. The internal structure of the message is parsed on-demand. This will be probably slower than 2+GB/s if your code instantly reads everything that was parsed, but the obvious question would be: how can you possibly compare these two approaches in practical terms? And, again, there isn't a clear answer, because the practical results will depend on the nature of your application.

* In general case, Protobuf parsing cannot be streamed (because of the handling of duplicates). This means, that, in practice, code that parses Protobuf content will be bottlenecked on I/O (it needs to wait for the end of the message before any parsing starts). Independently, depending on the typical Probobuf message in your application, it might be possible to parallelize parsing, which, again, will most likely outrun any single-threaded parser, but, just as examples above, cannot be said to be a winning strategy in general.

* It's usually a lot more efficient to combine parsing with creation of domain objects, which is a step an application will almost always have to take anyways. How this functionality is accessible from the parser will in many cases determine which parser will win the race.

----

Bottom line: Protobuf (or maybe parser in general) is just a bad candidate to try to measure / compare speeds. It's too low-level and poorly designed to be a good measuring stick for performance benchmarks.

> In general case, Protobuf parsing cannot be streamed (because of the handling of duplicates)

I don't see how last field wins stops you from streaming parse. Please elaborate.

Streaming in this context means that the parsing code hands off some parsed structures to the code outside of the parser before the entire message is processed. Suppose now you have this message:

    { x: 1, y: 2, x: 3 }

The streaming parser then reads field "x" with the value of "1", dispatches that value to the outside code, then reads "y", then reads "x" again, but the outside code had a side-effect associated with that field, which it already performed (eg. the outside code was printing the value of the field). Now, you have a program with an error. The right output should've been:

    y: 2
    x: 3

but you got:

    x: 1
    y: 2
    x: 3

Might not be so bad, depending on circumstances, but if you are betting on a soccer game outcome...

You could easily design a stream parser that rejects duplicates before it passes them off to the user, by maintaining a set of already encountered keys within the parser state. The space overhead isn't a concern unless your map/set has millions of entries, but your non-streaming parser would have choked from the memory usage long before then, anyways.

> You could easily design a stream parser that rejects duplicates before it passes them off to the user, by maintaining a set of already encountered keys within the parser state.

You could, but you are not allowed to. Protobuf parsing requires that the last duplicate key wins.

I see. But if this ambiguous repetition must be resolved, then it must be resolved either at input or output time. Protobuf seems to have optimized for the output case by allowing for updates to scalar fields by appending.

It doesn't need to be resolved at input time. Protobuf wire format allows repetition. If we want to be more pedantic, Protobuf wire format doesn't have a concept of dictionaries / hash-tables however you want to call them. It only has lists. What IML defines as "map" is, in fact, a list of key-value pairs, so there's no problem with repetition on the technical level.

However, the interpretation of the payload does require removing repetition, and in a bad way: the compliant implementation must remove all but the last matching key-value pair.

It's just plain stupid. But this stupidity isn't unique to Protobuf. There are plenty of stupid formats like this one. For example VHD (Microsoft's VM image format) puts some information necessary to interpret the image in what they call "footer" (i.e. at the end of the file). MOV (Mac QuickTime videos) and at least early versions of MP4 created on Macs used to put some metadata in the end of the file, making it impossible to stream them.

Unfortunately, it happens a lot that people design binary formats for the first time, and then the format succeeds for unrelated reasons. We have tons of bad but popular formats. PNG is up there at the top of the list together with HTML and PDF. Oh, and don't mention PSD -- that one would probably take the cake when it comes to the product of how bad it is and how popular it is...

It sounds like you have decided this design is "stupid" because you don't understand the motivation. This feature allows a proxy to alter a protobuf message without decoding it. That is a significant efficiency win in some important cases.

Perhaps I've just been looking more (because I've been going back to the books to pick up C again after a 20 year absence for a side project), but it feels like in recent years there has been a little uptick in people appreciating the level of control C provides, and "going back to basics".

I think the work on C23 (which would have still been C2x when this article was written), means people are starting to want to see innovation in the language again. I don't think that would have happened without Go and Rust showing some great thinking, but it's interesting that it's happening at all.

Tail calls are an interesting idea, and now I want to know more about both this extension, and also how to write my code in such a way as to help a compiler optimize for tail calls even when this extension isn't present. This somehow feels more fun to me than writing more Python or Ruby...

The great thinking exists since Modula-2 and Ada, the only issue was that most people weren't interested in safer systems programming, now that money is attached to every CVE fix, and C isn't going away in our lifetimes, naturally there needs to exist a way to improve what is already there.

> it feels like in recent years there has been a little uptick in people appreciating the level of control C provides, and "going back to basics".

I don't write C, and maybe it's because I somehow seek out these types of article and projects, but I'd agree, I'm seeing the same thing. It might be a backlash against programming languages like Rust or even JavaScript. Rust being really complicated, but clearly safer, and JavaScript... well it's just really hard to set up a development environment and the tooling is almost more predominant than the language itself.

While I don't write C myself, only for fun, I can see many just reverting to it, because it's "right there", it's fast, the language is fairly simple, even if the standard library is anything but simple, but you can take the stuff you need an ignore the rest.

I think it's fine to go back to C and maybe play around a bit to learn about some of the things that can be done, but I would implore you to bear in mind that the decades have taught us that the "ultimate danger" in question is basically that you're building sand castles in a minefield. We're not talking "oh ha ha I guess I wrote a few more bugs that a stronger type system would have caught", we're talking "oh ha ha I guess remote unauthenticated attackers can run arbitrary assembler as root in my network code because I tripped over one of C's nominally well-known mines that I did not personally know about and all the attackers had to do was slightly tweak one of the exploit scripts already in metasploit to install root kits on my system and add it to their bot net".

The world has gotten a lot more dangerous than people realize. People generally quite correctly assume that hackers aren't going to spend person-months attacking their system personally but don't realize that the attacker tools are quite sophisticated now and they don't have to. Shoving code into a small little buffer overflow to lift to a larger buffer overflow to load exploit code over the network that will run a pre-packaged root priv escalation to install a rootkit to add them to their botnet is no longer the work of a team of hackers for months. It's all off-the-shelf tech now and it's more like writing a 20 line function now; you don't need to attract very much attention now to attract that level of sophistication.

We are leaving C behind collectively for very good reasons. If you are playing with C and you do not intimately understand those reasons, you're going to relearn them the hard way.

Why do people have this idea that it's the language's job to protect you? C is a small piece of a much larger puzzle. That puzzle includes things like the memory page protection in your CPU's MMU. It includes things like SECCOMP BPF. It also includes things like ASAN, UBSAN, TSAN, etc. If you work in defense it might even include ASICs. The list goes on. Whatever language you believe in probably depends on C. You live in a C world. The value prop of the programming languages other than C/C++ for systems programming is that they build cohesive social communities and C is in the background serving them all. The whole "we're going to save you from the bogeyman" is just kool aid. No one will be saved.

> Why do people have this idea that it's the language's job to protect you?

Because we have the benefit of hindsight. We tried putting the responsibility in programmers' hands. It didn't work. We're now learning from that experience and building better tools that can do the same job, more safely. It's foolish not to use them.

> Whatever language you believe in probably depends on C. You live in a C world. The value prop of the programming languages other than C/C++ for systems programming is that they build cohesive social communities and C is in the background serving them all. The whole "we're going to save you from the bogeyman" is just kool aid.

No one is expecting the new languages to make computing suddenly error-free from top to bottom. It's about reducing the size of the surface where errors can be introduced. You're attacking a strawman here.

It is definitely on its way down, and it is a good thing.

You may think you can handle C. I disagree. The evidence is on my side. We need safer languages.

Heck, that even overstates it. It's not like we need super-safe languages per se. Maybe the 2070s will disagree with me. But we need languages that aren't grotesquely unsafe. It's not just that C isn't as safe as a language could be; it is that it is recklessly unsafe.

Interrupting that descent is foolishness, and pinning your career to a resurrection in C even more foolish. These technologies always have this meta-meta-contrarian phase to them just before they expire. I got into the computer field just as the meta-meta-contrarian "everything must be written in assembler" phase was at its peak. It was a false resurrection and I pity anyone who overinvested in learning how to write large applications in 100% assembler as a result of reading someone's over-enthusiastic screed about the virtues of writing pure assembler in 1998.

So I write this in the hopes of helping some other young HN reader not be fooled. C may be a thing you learn at some point, some day, just as assembler is still something you may learn. But don't get overexcited about the last meta-meta-contrarian false resurrection before death. Which is still years away, but at this point I think it's pretty much set on an irrevocable course.

I wrote 10s of thousands of lines of C code in the 90s and early 00s (without buffer overflows that I learned about; I did also write a evil network layer for inducing buffer overflows in my and my dependencies code), and have been doing a lot of other languages since then, and then had occasion to write some more C where I was doing string allocation and manipulation (for a LD_PRELOAD to monitor what various programs I lacked source to where doing), and it was absolutely nerve wracking. Linux kernel might be mostly C for a long time, but it would be crazy to start a new thing in C. There's growing re-write projects from C to Rust. It would be farther along except the Rust people seem to dislike laboriously recreating decades of GNU long-opt functionality in all these base packages to actually make Rust drop-in replacements for C.

Maybe for embedded, I haven't done that, but for general purpose, I can't imagine it being worth the risks.

C is great for its basics, but at some level, when you gain awareness of all of your unpunished pointer lifetime sins and how annoying the lack of basic data structures is in stdlib, then you realize that biting the bullet and learning something like Rust is the way to go. Maybe a better stdlib would help, but I've found that the pain of becoming proficient in Rust has paid back multiples over the continual toil of doing certain repetitive tasks in C.

It is not terribly hard to find some library for data structures in C. What I do not like about Rust is: syntax, complexity, long compile times, cargo, ...

C is great and I definitely recommend learning it & becoming proficient in it. I don't think I would recommend it for a new project, for all the reasons jerf mentions in a sibling comment[1]. But there's a ton of existing code out there, and I think it provides a nice little peek into what the hardware is actually doing that you don't get from higher level languages.

> I think the work on C23 means people are starting to want to see innovation in the language again.

I have to admit this makes me nervous. Part of what's nice about C is its simplicity and its stability. It's relatively easy to write a compiler, and the lack of having many different ways to do the same thing means all code written in C tends to "feel" roughly the same (outside some domain-specific areas), which means it's easy to dive into a new codebase and get up to speed. It's basically all structs & functions and abstractions built upon them. That's it.

While I definitely am not opposed to making improvements to the C spec, seeing the inline anonymous functions in a proposal linked elsewhere in this thread[2] really turns me off. It starts to feel like the hideous mess that C++ has turned into, with oodles of implicit behavior, un-greppable code, generated symbol names leading to compiler error hell, and a thousand different colors for your bike shed.

We already have languages that can do all this stuff. I welcome making changes to C to make it nicer to program in C, but I'm wary of proposals that turn C into yet another messy kitchen sink programming language. Let's keep it clean, obvious, and simple, please.

[1] https://news.ycombinator.com/item?id=41291206

[2] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3266.htm#u...

Easy, the direction WG14 is going for, is "C++ without classes, templates, improved memory safety".

Already there you can guess what features are on WG14 roadmap without going to the ISO mailings.

C is not a dead language to literally millions of other programmers and the billions of ongoing practical applications of it.

If your threshold for a language's alive-ness is the presence or absence of a package manager, you should reevaluate.

1. I don't think the presence of centralized package management makes or breaks a language for most people.

2. Your platform's package manager should have plenty of C packages.

I think the success of Javascript and Rust certainly tests your biases.

I can say anecdotally that the lack of a standard package manager in C and C++ kept me from exploring those languages.

And even if one wants to use dependencies not available as a distro devel package (or as a fallback to missing distro devel package), things like meson and cmake make it relatively straightforward to use vendored dependencies.

New account, banal observation, invitation to interact. It's a pattern I've seen recently on what are apparently LLM-powered engagement bots on Twitter.

What are those engagement bots for? Are they farming followers and turn into disinformation machines later? Or is that only to drive engagement to cultivate ad revenue for Twitter, or for some “brand” that is adjacent to their topic?

The way hn karma works, you’re incentivized to accumulate a set of bots that meet the threshold for flagging and downvoting. Barring effective anti-bot/brigading measures (some of which I’m sure exist), such a team of bots could manipulate front page content. That’s a valuable capability for a variety of nefarious aims in this space.

A straightforward one would be either subtly or dramatically tilting sentiment here negative against some mid-cap public tech company and profiting from a short position.

(Assuming you think investors care what gets upvoted on hn…)

Honestly I hope the LLM-powered bots will be hugely beneficial for the online discourse in the long term (and keep creating mayhem in the short-term).

My hope is that long-term people will realize that most of the arguments and outrage they see online is manufactured to boost engagement and simply start ignoring it or seeking private invite-only forums to discuss things.

People mostly failed to realize this when humans controlled fake/bot/troll accounts, but with the proliferation of LLMs the realization is spreading much faster.

I know it sounds naive, but that's my hope - that the global village will once again become local instead.

It's a harmless comment that could even generate some interesting responses. I think the bar for accusing other commentators of being LLMs should be set higher than this.

How do you think it makes people who write ("generate" cheapens it) the interesting responses feel? Answering people's questions has two main rewards to the writer: it feels good to help someone else, and it boosts your ego to pontificate on something you know that others might not. Discovering that you're replying to a bot makes you feel like you've been fooled.

Anyway, I'm more aware than ever now that the people we interact with in text, online, are gradually being diluted by bots, and I don't want to participate in a community of bots.

Maybe the comment isn't generated by an LLM, and I didn't make any direct accusations. But it is a weird first comment. You expect first comments by fresh new accounts to show evidence of having been motivated to create an account because they had something to say - it requires non-zero activation energy. Meanwhile, there's plenty of incentive to try and sway what gets attention here on HN.

>who write ("generate" cheapens it)

Eh? I'm saying that the comment could generate (i.e. provoke) interesting responses. I'm not saying that the responses themselves will be "generated" by their authors as opposed to being written.

>Maybe the comment isn't generated by an LLM

Indeed. I think we should definitely bear this possibility in mind!

I understand the potential nefarious motives for spamming the site with LLM comments. It's just that there isn't really any evidence in this case that the comment is generated by an LLM. I wouldn't be surprised if it was, but I also wouldn't be surprised if it wasn't.

（评论） (comments)

（评论）
(comments)