(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39812969

你的分析似乎很到位。 虽然 Fiber 和 async/await 都在高层提供类似的功能,但区别在于它们的实现细节。 Async/await 依赖于编译时转换来生成轻量级状态机,而纤程则依赖于运行时原语和手动控制流操作。 两者都有优点和缺点,选择最终取决于用例和所需的属性。 一个有趣的观察是,尽管低开销、手动控制的光纤具有潜在优势,但它们似乎仍然在很大程度上缺席该行业。 除了实施光纤的技术挑战之外,还有其他原因阻碍其广泛采用吗? 或者,该行业是否越来越受到托管环境的主导,使得细粒度、低开销的并发变得不再那么重要? 关于 Rust 上下文中的 Fiber 和 async/await 之间的比较,Rust 在该领域的产品(即 async Rust)似乎采用了这两种模型的元素。 通过为纤程式协程和异步/等待提供一流的支持,Rust 为开发人员提供了灵活性和简单性,使他们能够根据上下文选择最适合自己需求的模型。 然而,这种混合方法是否能够成为事实上的标准,或者一种方法最终会占上风,还有待观察。 最后,值得注意的是,无论选择哪种特定光纤或异步/等待实现,仔细考虑吞吐量和延迟之间的权衡对于设计高性能并发系统仍然至关重要。 最终的目标是最大限度地减少浪费,确保在整个计算过程中有效利用资源,并确保整个系统保持速度和响应能力的最佳平衡。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Why choose async/await over threads? (notgull.net)
407 points by thunderbong 1 day ago | hide | past | favorite | 400 comments










Async/await with one thread is simple and well-understood. That's the Javascript model. Threads let you get all those CPUs working on the problem, and Rust helps you manage the locking. Plus, you can have threads at different priorities, which may be necessary if you're compute-bound.

Multi-threaded async/await gets ugly. If you have serious compute-bound sections, the model tends to break down, because you're effectively blocking a thread that you share with others.

Compute-bound multi-threaded does not work as well in Rust as it should. Problems include:

- Futex congestion collapse. This tends to be a problem with some storage allocators. Many threads are hitting the same locks. In particular, growing a buffer can get very expensive in allocators where the recopying takes place with the entire storage allocator locked. I've mentioned before that Wine's library allocator, in a .DLL that's emulating a Microsoft library, is badly prone to this problem. Performance drops by two orders of magnitude with all the CPU time going into spinlocks. Microsoft's own implementation does not have this problem.

- Starvation of unfair mutexes. Both the standard Mutex and crossbeam-channel channels are unfair. If you have multiple threads locking a resource, doing something, unlocking the resource, and repeating that cycle, one thread will win repeatedly and the others will get locked out.[1] If you need fair mutexes, there's "parking-lot". But you don't get the poisoning safety on thread panic that the standard mutexes give you.

If you're not I/O bound, this gets much more complicated.

[1] https://users.rust-lang.org/t/mutex-starvation/89080



Yes, 100%.

I've mostly only dealt with IO-bound computations, but the contention issues arise there as well. What's the point of having a million coroutines when the IO throughput is bounded again? How will coroutines save me when I immediately exhaust my size 10 DB connection pool? It won't, it just makes debugging and working around the issues harder and difficult to reason about.



The debugging issue is bigger than what it seems.

Use of async/await model in particular ends up with random hanged micro tasks in some random place in code that are very hard to trace back to the cause because they're dispersed potentially anywhere.

Concurrency is also rather undefined, as are priorities most of the time.

This can be partly fixed by labelling, which adds more complexity, but at least is explicit. Then the programmer needs to know what to label... Which they won't do, and Rust has no training wheels to help with concurrency.

Threads, well you have well defined ingress and egress. Priorities are handled by the OS and to some degree fairness is usually ensured.



Just morning bathroom musings based on your posts (yep /g) and this got me thinking maybe the robust solution (once and for all for all languages) may require a rethink at the hardware level. The CPU bound issue comes down to systemic interrupt/resume I think; if this can be done fairly for n wip thread-of-execution with efficient queued context swaps (say maybe a cpu with n wip contexts) then the problem becomes a resource allocation issue. Your thoughts?


> a cpu with N wip contexts

That's what "hyper-threading" is. There's enough duplicated hardware that beyond 2 hyper-threads, it seems to be more effective to add another CPU. If anybody ever built a 4-hyperthread CPU, it didn't become a major product.

It's been tried a few times in the past, back when CPUs were slow relative to memory. There was a National Semiconductor microprocessor where the state of the CPU was stored in main memory, and, by changing one register, control switched to another thread. Going way back, the CDC 6600, which was said to have 10 peripheral processors for I/O, really had only one, with ten copies of the state hardware.

Today, memory is more of a bottleneck that the CPU, so this is not a win.



The UltraSPARC T1 had 4-way SMT, and its successors bumped that to 8-way. Modern GPU compute is also highly based on hardware multi-threading as a way of compensating for memory latency, while also having wide execution units that can extract fine-grained parallelism within individual threads.


Also, IBM POWER have SMT at levels above 2; at least POWER 7 had 4-way SMT ("hyperthreading").


Missed that. That's part of IBM mainframe technology, where you can have "logical partitions", a cluster on a chip, and assign various resources to each. IBM POWER10 apparently allows up to 8-way hyperthreading if configured that way.


Thanks, very informative.


What you said sounded in my head more like you’re describing a cooperatively scheduled OS rather than a novel hardware architecture.


(This has been a very low priority background thread in my head this morning so cut me some slack on hand waving.)

Historically, the H/W folks addressed (pi) memory related architectural changes, such as when multicore came around and we got level caches. Imagine if we had to deal at software level with memory coherence in different cores [down to the fundamental level of invalidating Lx bytes]. There would be NUMA like libraries and various hacks to make it happen.

Arguably you could say "all that is in principle OS responsibility even memory coherence across cores" and we're done. Or you would agree that "thank God the H/W people took care of this" and ask can they do the same for processing?

The CPU model afaik hasn't changed that much in terms of granularity of execution steps whereas the H/W people could realize that d'oh an execution granularity in conjunction with hot context switching mechanism, could really help the poor unwashed coders in efficiently executing multiple competing sequences of code (which is all they know about at H/W level).

If your CPU's architecture specs n+/-e clock ticks per context iteration, then you compile for that and you design languages around that. CPU bound now becomes heavy CPU usage but is not a disaster for any other process sharing the machine with you. It becomes a matter of provisioning instead of programming ad-hoc provisioning ..



If our implementations are bad because of preemption, then I’m not sure why the natural conclusion isn’t “maybe there should be less preemption” instead of “[even] more of the operating system should be moved into the hardware”.


If you have fewer threads ready to run than CPU cores, you never have any good reason to interrupt one of them.


I don't know how the challenges with cooperative (no-preemptive) multitasking keep needing to get rediscovered. Even golang, which I consider a very responsibly designed language, went with cooperative at first until they were forced to switch to preemptive. Not saying cooperative multitasking doesn't have its place, just that it's gotta have a warning sticker or even better disallow certain types of code from executing statically.

Also great time to plug a related post, What color is your function:

https://journal.stuffwithstuff.com/2015/02/01/what-color-is-...



Also there is the extra software development and maintenance cost due to coloured functions that async/await causes

https://journal.stuffwithstuff.com/2015/02/01/what-color-is-...

Unless you are doing high scalability software, async might not be worth of the trade offs.



If you expect to color your functions async by default, it's really easy to turn a sync functuion into a near-zero-cost async, a Future that has already been resolved at construction time by calling the sync function.

This way, JS / TS becomes pretty comfortable. Except for stacktraces, of course.



Useful stacktraces is one of the reasons why I use Go instead of JS/TS on the server-side.


Except when your sync function calls a function that makes the async runtime deadlock. No, there's no compiler error or lint for that. Good luck!


> Async/await with one thread is simple and well-understood. That's the Javascript model.

Bit of nuance (I'm not an authority of this and I don't know the current up-to-date answer across 'all' of the Javascript runtimes these days):

Isn't it technically "the developer is exposed the concept of just one thread without having to worry about order of execution (other than callbacks/that sort of pre-emption)" but under the hood it can actually be using as many threads as it wants (and often does)? It's just "abstracted" away from the user.



I don't think so. Two threads updating the same variable would be observable in code. There's no way around it.


I'm referring to something like this: https://stackoverflow.com/questions/7018093/is-nodejs-really...

It's like a pedantic technical behind the scenes point I think, just trying to learn "what's true"



Looks like the spec refers to the thing-that-has-the-thread as an "agent". https://tc39.es/ecma262/#sec-agents

I don't know the details about how any implementation of a javascript execution environment allows for the creation of new agents.



I mean, yeah, if you go deep enough your OS may decide to schedule your browser thread to a different core as well. I don’t think it has any relevance here - semantically, it is executed on a single thread, which is very different from multi-threading.


It is executed by your runtime which may or may not behind the scenes be using a single thread for your execution and/or the underlying I/O/eventing going on underneath, no?


Most JS runtimes are multi-threaded behind the scenes. If you start a node process:

  node -e "setTimeout(()=>{}, 10_000)" &
Then wait a second and run:

  ps -o thcount $!
Or on macOS:

  ps -M $!
You'll see that there are multiple threads running, but like others have said, it's completely opaque to you as a programmer. It's basically just an implementation detail.


I'm not sure what you mean by "multi-threaded async/wait"... Isn't the article considering async/await as an alternative to threads (i.e. coroutines vs threads)?

I'm a C++ programmer, and still using C++17 at work, so no coroutines, but don't futures provide a similar API? Useful for writing async code in serialized fashion that may be easier (vs threads) to think about and debug.

Of course there are still all the potential pitfalls that you enumerate, so it's no magic bullet for sure, but still a useful style of programming on occasion.



They mean async/await running over multiple OS threads compared to over one OS thread.

You can also have threads running on one OS thread (Python) or running on multiple OS threads (everything else).

Every language’s concurrency model is determined by both a concurrency interface (callbacks, promises, async await, threads, etc), and an implementation (single-threaded, multiple OS threads, multiple OS processes).



async/await tasks can be run in parallel on multiple threads, usually no more threads than there are hardware threads. This allows using the full capabilities of the machine, not just one core's worth. In a server environment with languages that support async/await but don't have the ability to execute on multiple cores like Node.js and Python, this is usually done by spawning many duplicate processes and distributing incoming connections round-robin between them.


Jemalloc can use separate arenas for different threads which I imagine mostly solves the futex congestion issue. Perhaps it introduces new ones?


IIRC glibc default malloc doesn't use per-thread arenas as they would waste too much memory on programs spawning tens of thousands of threads, and glibc can't really make too many workload assumptions. Instead I think it uses a fixed pool of arenas and tries to minimize contention.

These days on Linux, with restartable sequences, you can have true per-cpu arenas with zero contention. Not sure which allocator use them though.



> As pressure from thread collisions increases, additional arenas are created via mmap to relieve the pressure. The number of arenas is capped at eight times the number of CPUs in the system (unless the user specifies otherwise, see mallopt), which means a heavily threaded application will still see some contention, but the trade-off is that there will be less fragmentation.

https://sourceware.org/glibc/wiki/MallocInternals

So glibc's malloc will use up to 8x #CPUs arenas. If you have 10_000 threads, there is likely to be contention.



https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree...

Thank you, I didn't know about this one. An allocator that seems to use it at https://google.github.io/tcmalloc/rseq.html



Not only does tcmalloc use rseq, the feature was contributed to Linux by the tcmalloc authors, for this purpose, among other purposes.


I keep having to repeat it on this unfortunate website: it's an implementation detail, a multi-threaded executor of async/await can cope with starvation perfectly well as demonstrated by .NET's implementation that can shrug off some really badly written code that interleaves blocking calls with asynchronous.

https://news.ycombinator.com/item?id=39530435

https://news.ycombinator.com/item?id=39786142

https://news.ycombinator.com/item?id=39721626



Said badly with code will still execute with poor performance and the mix can be actively hard to spot.


Badly written threaded code will have the same problem, unfortunately


You would be surprised. It ultimately regresses to "thread per request but with extra steps". I remember truly atrocious codebases that were spamming task.Result everywhere and yet there were performing tolerably even back on .NET Framework 4.6.1. The performance has (and often literally) improved ten-fold since then, with threadpool being rewritten, its hill-climbing algorithm receiving further tuning and it getting proactive blocked workers detection that can inject threads immediately without going through hill-climbing.


I find this has worked well for me when I can easily state what thread pool work gets executed on.


I think its kotlin's model. The language don't have a default co-routine executor set, you need to provide your own or spawn one with the std. It can be thread-pooled, single thread, or custom one but there isn't a default one set.

If you use single threaded executor, then race condition won't happen. If you choose pooled, then you obviously should realize there can be a race condition. It's all about choice you made.



You focus on rust rather than generalizing...

If you are IO bound, consider threads. This is almost the same as async / await.

What was missing above, and the problem with how most compute education is these days, if you are compute bound you need to think about processes.

If you were dealing with python concurrent.futures, you would need to consider processpooexecutor vs. threadpoolexecutor.

Threadpoolexecutor gives you the same as the above.

With multiprocessor executor, you will have multiple processes executing independently but you have to copy a memory space. Which people don't consider. In python DS work - multiprocessor workloads need to determine memory space considerations.

It's kinda f'd up how JS doesn't have engineers think about their workloads.



I think you are coming at this from a particular Python mindset, driven by the limitations imposed on Python threading by the GIL. This is a peculiarity specific to Python rather than a general purpose concept about threads vs processes.


[...] if you are compute bound you need to think about processes.

How would that help? Running several processes instead of several threads will not speed anything up [1] and might actually slow you down because of additional inter-process communication overhead.

[1] Unless we are talking about running processes across multiple machines to make use of additional processors.



I think you need to clarify what you mean by "thread". For example they are different things when we compare Python and Java Threads. Or OS threads and green threads. I think the GP was relating to OS threads.


I was also referring to kernel threads. If we are talking about non-kernel threads, then sure, a given implementation might have limitations and there might be something to be gained by running several processes, but that would be a workaround for those limitations. But for kernel threads there will generally be no gain by spreading them across several processes.


Right; a process is just a thread (or set of threads) and some associated resources - like file descriptors and virtual memory allocations. As I understand it, the scheduler doesn’t really care if you’re running 1000 processes with 1 thread each or 1 process with 1000 threads.

But I suspect it’s faster to swap threads within a process than swap processes, because it avoids expensive TLB flushes. And of course, that way there’s no need for IPC.

All things being equal, you should get more performance out of a single process with a lot of threads than a lot of individual processes.



> a process is just a thread (or set of threads) and some associated resources - like file descriptors and virtual memory allocations

Or rather the reverse, in Linux terminology. Only processes exist, some just happen to share the same virtual address space.

> the scheduler doesn’t really care if you’re running 1000 processes with 1 thread each or 1 process with 1000 threads

Not just the scheduler, the whole kernel really. The concept of thread vs process is mainly a userspace detail for Linux. We arbitrarily decided that the set of clone() parameters from fork() create a process, while the set of clone() parameters through pthread_create() create a thread. If you start tweaking the clone() parameters yourself, then the two become indistinguishable.

> it’s faster to swap threads within a process than swap processes, because it avoids expensive TLB flushes

Right, though this is more of a theorical concern than a practical one. If you are sensible to a marginal TLB flush, then you may as well "isolcpu" and set affinities to avoid any context switch at all.

> that way there’s no need for IPC

If you have your processes mmap a shared memory, you effectively share address space between processes just like threads share their address space.

For most intent and purposes, really, I do find multiprocessing just better than multithreading. Both are pretty much indistinguishable, but separate processes give you the flexibility of being able to arbitrarily spawn new workers just like any other process, while with multithreading you need to bake in some form of pool manager and hope to get it right.



> The concept of thread vs process is mainly a userspace detail for Linux. We arbitrarily decided that the set of clone() parameters from fork() create a process, while the set of clone() parameters through pthread_create() create a thread. If you start tweaking the clone() parameters yourself, then the two become indistinguishable.

That's from Plan 9. There, you can fork, with various calls, sharing or not sharing code, data, stack, environment variables, and file descriptors.[1] Now that's in Linux. It leads to a model where programs are divided into connected processes with some shared memory. Android does things that way, I think.

[1] https://news.ycombinator.com/item?id=863939



Or rather the reverse, in Linux terminology. Only processes exist, some just happen to share the same virtual address space.

Threads and processes are semantically quite different in standard computer science terminology. A thread has an execution state, i.e. its set of set of processor register values. A process on the other hand is a management and isolation unit for resources like memory and handles.



> If you are IO bound, consider threads. This is almost the same as async / await.

Only in Python.

> if you are compute bound you need to think about processes.

Also only in Python.



If you're using threads then consider not using Python. Or, just consider not using Python.


Backend JS just spins up another container and/or lambda and if it's too slow and requires multiple CPUs in a single deployment, oh well, too bad.


That is of course a huge overhead, compared to how other languages solve the problem.


The complaints around async/await vs threads to my mind have not been that one is more or less complex than the other. It is that it bifurcates the ecosystem and one of them ends up being a second class citizen causing friction when you choose the wrong one for your project.

While you can mix and match them it's hacky and inefficient when you need to. As it stands now the Rust ecosystem has decided that if you want to do anything involving IO you are stuck with an all async/await ecosystem. Since nearly everything you might want to do in Rust probably involves IO with very few exceptions that means for the most part you should probably ignore non-async libraries regardless of whether the rest of your application wants it to be async or not.

There is a hypothetical world where Rust used abstractions that are even more composable than async/await, whose composability really wants everything else to be async/await too. If that had happened then I think most of the complaints would have disappeared.



I agree with your diagnosis. It's what I concluded in my own Rust async blog post[0](which surely are mandatory now). It's even worse than bifurcating the ecosystem because even within async code it's almost always closely tied to the executor, usually Tokio. I talk about this as an extension to function colouring, adopting without.boats's three colour proposition with blue(non-IO), green(blocking-IO), and red(async-IO). In the extended model it's really blue, green, red(Tokio), purple(async-std), and orange(smol) etc.

I find that the sans-IO pattern is the best solution to this problem. Under this pattern you isolate all blue code and use inversion of control for I/O and time. This way you end up with the core protocol logic being unaware of IO and it becomes simple to wrap it in various forms of IO.

0: https://hugotunius.se/2024/03/08/on-async-rust.html



I love the fact that people outside the Python ecosystem are spreading the word about sans-IO. I think it should be the next iteration in coding using futures-based concurrency. I only wish it were more popular in the Python land as well.


The Haskell folks got there first.


This is a cool pattern, thanks for the share.


No problem, here are some examples of it:

* Quinn(A QUIC implementation, in particular `quinn-proto` is sans-IO, whereas the outer crate, `quinn`, is Tokio-based)[0)

* str0m(A WebRTC implementation that I work on, it's an alternative to `webrtc-rs`. We don't have any IO-aware wrappers, the user is expected to provide that themselves atm)[1]

0: https://github.com/quinn-rs/quinn

1: https://github.com/algesten/str0m/



> Since nearly everything you might want to do in Rust probably involves IO with very few exceptions that means for the most part you should probably ignore non-async libraries regardless of whether the rest of your application wants it to be async or not.

Only if you have two libraries to choose from and they are otherwise identical, which is rare. Using blocking code in async applications is not as seamless as it should be but not hard. Instead of writing `foo()` you write `tokio::spawn_blocking(foo).await`. It will run the new code in a separate thread and return a future that will resolve once that thread is done.



That assumes you are using Tokio. As another poster said not only does the ecosystem fragment along the async/non async lines but along the runtime lines. Async is an extremely leaky abstraction. You are in a way making my point for me. If you want to avoid painful refactoring you should basically always start out tokio-async and shim in non async code as needed because going the other way is going to hurt.


`spawn_blocking` is a single function and it is not complicated, it shouldn't be too hard for other runtimes to do the same.

The end application does have to choose a runtime anyway and will have to stick with it because this area isn't standardized yet. This problem mostly affects the part of the ecosystem that wants to put complicated concurrency logic into libraries.



`spawn_blocking` should be part of a core executor interface that all executors must provide.


And of course most libraries don't even need IO because the application can do it for them, so it only makes sense for them to be async if they're computationally heavy enough to cause problems for the runtime.


>As it stands now the Rust ecosystem has decided that if you want to do anything involving IO you are stuck with an all async/await ecosystem.

I mean C# works basically the same way, even through there are non-async options for IO using the async options basically forces you to be async all the way back to Main(). There are ways to safely call async methods from sync methods but they make debugging infinitely harder.



Well, yes. That doesn't mean it's not annoying though. It happens in every language that provides syntactic support for the distinction between async/await and non async. It's, I think, core to the syntactic and semantic abstractions that were popularized by Javascript.


Issues with the article:

1. Only one example is given (web server), solved incorrectly for threads. I will elaborate below.

2. The question is framed as if people specifically want OS threads instead of async/await .

But programmers want threads conceptually, semantically. Write sequential logic and don't use strange annotations like "async". In other words, if async/await is so good, why not make all functions in the language implicitly async, and instead of "await" just write normal function calls? Then you will suddenly be programming in threads.

OS threads are expensive due to statically allocated stack, and we don't want that. We want cheap threads, that can be run in millions on a single CPU. But without the clumsy "async/await" words. (The `wait` word remains in it's classic sense: when you wait for an event, for another thread to complete, etc - a blocking operation of waiting. But we don't want it for function invocations).

Back to #1 - the web server example.

When timeout is implemented in async/await variant of the solution, using the `driver.race(timeout).await`, what happens to the client socket after the `race` signals the timeout error? Does socket remain open, remains connected to the client - essentially leaked?

The timeout solution for threaded version may look almost the same, as it looks for async/await: `threaded_race(client_thread, timeout).wait`. This threaded_race function uses a timer to track a timeout in parallel with the thread, and when the timeout is reached it calls `client_thread.interrupt()` - the Java way. (The `Thread.interrupt()`, if thread is not blocked, simply sets a flag; and if the thread is blocked in an IO call, this call throws an InterruptedException. That's a checked exception, so compiler forces programmer to wrap the `client.read_to_end(&mut data)` into try / catch or declare the exception in the `handle_client`. So programmer will not forget to close the client socket).



> When timeout is implemented in async/await variant of the solution, using the `driver.race(timeout).await`, what happens to the client socket after the `race` signals the timeout error?

Any internal race() values will be `Drop`ed and driver itself will remain (although rust will complain you are not handling the Result if you type it 'as is'), if a new socket was created local to the future it will be cleaned up.

The niceness of futures (in Rust) is that all the behavior around it can be defined, while "all functions are blocking." as you state in a sibling comment, Rust allows you to specify when to defer execution to the next task in the task queue, meaning it will poll tasks arbitrarily quickly with an explicitly held state (the Future struct). This makes it both very fast (compared to threads which need to sleep() in order to defer) and easy to reason about.

Java's Thread.interrupt is also just a sleep loop, which is fine for most applications to be fair. Rust is a system language, you can't have that in embedded systems, and it's not desirable for kernels or low-latency applications.



> Java's Thread.interrupt is also just a sleep loop

You probably mean that Java's socket reading under the hood may start a non-blocking IO operation on the socket, and then run a loop, which can react on Thread.interrupt() (which, in turn, will basically be setting a flag).

But that's an implementation detail, and it does not need to be implemented that way.

It can be implemented the same way as async/await. When a thread calls socket reading, the runtime system will take the current threads continuation off the execution, and use CPU to execute the next task in the queue. (That's how Java's new virtual threads are implemented).

Threads and async/await are basically the same thing.

So why not drop this special word `async`?



> So why not drop this special word `async`?

You can drop the special word in Rust it's just sugar for 'returns a poll-able function with state'; however threads and async/await are not the same.

You can implement concurrency any way you like, you can run it in separate processes or separate nodes if you are willing to put in the work, that does not mean they equivalent for most purposes.

Threads are almost always implemented preemptively while async is typically cooperative. Threads are heavy/costly in time and memory, while async is almost zero-cost. Threads are handed over to the kernel scheduler, while async is entirely controlled by the program('s executor).

Purely from a merit perspective threads are simply a different trade-off. Just like multi-processing and distributed actor model is.



> Threads are almost always implemented preemptively while async is typically cooperative. Threads are heavy/costly in time and memory, while async is almost zero-cost. Threads are handed over to the kernel scheduler, while async is entirely controlled by the program('s executor).

Keyword here being almost. See Project Loom.



@f_devd, cooperative vs preemptive is a good point.

(That threads are heavy or should be scheduled by OS is not required by the nature of the threads).

But preemptive is strictly better (safer at least) than cooperative, right? Otherwise, one accidental endless loop, and this code occupies the executor, depriving all other futures from execution.

@gpderetta, I think Project Loom will need to become preemptive, otherwise the virtual threads can not be used as a drop-in replacement for native threads - we will have deadlocks in virtual threads where they don't happen in native threads.



Preemptive is safer for liveliness since it avoids 'starvation' (one task's poll taking too long), however it in practice almost always more expensive in memory and time due to the implicit state.

In async, only the values required to do a poll need to be held (often only references), while for threads the entire stack & registers needs to be stored at all times, since at any moment it could be interrupted and it will need to know where to continue from. And since it needs to save/overwrite all registers at each context switch (+ scheduler/kernel handling), it takes more time overall.

In general threads are a good option if you can afford the overhead, but assuming threads as a default can significantly hinder performance (or make near impossible to even run) where Rust needs to.



Java can afford that. M:N threads come with a heavy runtime. Java has already a heavy runtime, so what is a smidgen more flab?

Source: https://github.com/rust-lang/rfcs/blob/master/text/0230-remo...



So it seems that the biggest issue was having a single Io interface forcing overhead on both green and native threads and forcing runtime dispatching.

It seems to me that the best would have been to have the two libraries evolve separately and capture the common subset in a trait (possibly using dynamic impl when type erasure is tolerable), so that you can write generic code that can work with both or specialized code to take advantage of specific features.

As it stand now, sync and async are effectively separated anyway and it is currently impossible to write generic code that hande both.



> In other words, if async/await is so good, why not make all functions in the language implicitly async, and instead of "await" just write normal function calls? Then you will suddenly be programming in threads.

It has been tried various times in the last decades. You want to search for "RPC". All attempts at trying to unify sync and async have failed, because there is a big semantical difference between running code within a thread or between threads or even between computers. Trying to abstract over that will eventually be insufficient. So better learn how to do it properly from the beginning.



I think you've got some of this in your own reply, but ... I feel like Erlang has gone all in on if async is good, why not make everything async. "Everything" in Erlang is built on top of async messaging passing, or the appearance thereof. Erlang hasn't taken over the world, but I think it's still successful; chat services descended from ejabberd have taken over the world; RabbitMQ seems pretty popular, too. OTOH, the system as a whole only works because Erlang can be effectively preemptive in green threads because of the nature of the language. Another thing to note is that you can build the feeling of synchronous calling by sending a request and immediately waiting for a response, but it's vary hard to go the other way. If you build your RPC system on the basis of synchronous calls, it's going to be painful --- sometimes you want to start many calls and then wait for the responses together, that gets real messy if you have to spawn threads/tasks every time.


I'm not very familiar with Erlang, but from my understanding, Erlang actually does have this very distinction - you either run local code or you interact with other actors. And here the big distinction gets quite clear: once you shoot a message out, you don't know what will happen afterwards. Both you or the other actor might crash and/or send other messages etc.

So Erlang does not try to hide it, instead, it asks the developer to embrace it and it's one of its strength.

That being said, I think that actors are a great way to model a system from the birds-perspective, but it's not so great to handle concurrency within a single actor. I wish Erlang would improve here.



Actors are a building block of concurrency. IMHO, it doesn't make sense to have concurrency within an actor, other than maybe instruction level concurrency. But that's very out of scope of Erlang, BEAM code does compile (JIT) to native code on amd64 and arm64, but the JIT is optimized for speed, since it happens at code load time, it's not an profiling/optimizing JIT like Java's hotspot. There's no register scheduler like you'd need to achieve concurrency, all the beam ops end up using the same registers (more or less), although your processor may be able to do magic with register renaming and out of order operations in general.

If you want instruction level concurrency, you should probably be looking into writing your compute heavy code sections as Native Implemented Functions (NIFs). Let Erlang wrangle your data across the wire, and then manipulate it as you need in C or Rust or assembly.



> IMHO, it doesn't make sense to have concurrency within an actor, other than maybe instruction level concurrency

I think it makes sense to have that, including managing the communication with other actors. Things like "I'll send the message, and if I don't hear back within x minutes, I'll send this other message".

Actors are very powerful and a great tool to have at your disposal, but often they are too powerful for the job and then it can be better to fall back to a more "low level" or "local" type of concurrency management.

At least that's how I feel. In my opinion you need both, and while you can get the job done with just one of them (or even none), it's far from being optimal.

Also, what you mention about NIFs is good for a very specific usecase (high performance / parallelism) but concurrency has a broader scope.



> Things like "I'll send the message, and if I don't hear back within x minutes, I'll send this other message".

I assume you don't want to wait with a x minute timeout (and meantime not do anything). You can manage this in three ways really:

a) you could spawn an actor to send the message and wait for a response and then take the fallback action.

b) you could keep a list (or other structure, whatever) of outstanding messages and timeouts, and prune the list if you get a response, or otherwise periodically check if there's a timeout to process.

c) set a timer and do the thing when you get the timer expiration message, or cancel the timer if you get a response. (which is conceptually done by sending a message to the timer server actor, which will send you a timer handle immediately and a timer expired message later; there is a timer server you can use through the timer module, but erlang:send_after/[2,3] or erlang:start_timer/[3,4] are more efficient, because the runtime provides a lot of native timer functionality as needed for timeouts and what not anyway)

Setting up something to 'automatically' do something later means asking the question of how is the Actor's state managed concurrently, and the thing that makes Actors simple is by being able to answer that the Actor always does exactly one thing at a time, and that the Actor cannot be interrupted, although it can be killed in an orderly fashion at any time, at least in theory. Sometimes the requirement for an orderly death means it may mean an operation in progress must finish before the process can be killed.



Exactly. Now imagine a) is unessarily powerful. I don't want to manage my own list as in b), but other than that, b) sounds fine and c) is also fine, though, does it need an actor in the background? No.

In other words, having a well built concept for these cases is important. At least that's my take. You might say "I'll just use actors and be fine", but for me it's not sufficient.



Oh and just to add onto it, I think async/await is not really the best solution to tackle these semantic difference. I prefere the green-thread-IO approach, which feels a might more heavy but it leads to a true understanding how to combine and control logic in a concurrent/parallel setting. Async/await is great to add it to languages that already have something like promises and want to improve syntax in an easy way though, so it has its place - but I think it was not the best choice for Rust.


There's also writing your code with poll() and select(), which is its own thing.


well that's the great thing with async rust: you write with poll and select without writing poll and select. let the computer and the compiler get this detail out of my way (seriously I don't want to do the fd interest list myself).

and I can still write conceptually similar select code using the, well, select! macro provided by most async runtimes to do the same on a select list of futures. better separation, easier to read, and overall it boils down to the same thing.



> OS threads are expensive due to statically allocated stack, and we don't want that. We want cheap threads, that can be run in millions on a single CPU. But without the clumsy "async/await" words.

Green threads ("cheap threads") are still expensive if you end up spreading a lot of per-client state on the stack. That's because with async/await and CPS you end up compressing the per-client state into a per-client data structure, and you end up having very few function call activation frames on the stack, all of which unwind before blocking in the executor.



> In other words, if async/await is so good, why not make all functions in the language implicitly async, and instead of "await" just write normal function calls?

IIRC withoutboats said in one of the posts that the true answer is compatibility with C.



> if async/await is so good, why not make all functions in the language implicitly async, and instead of "await" just write normal function calls?

You mean like Haskell?

The answer is that you need an incredibly good compiler to make this behave adequately, and even then, every once in a while you'll get the wrong behavior and need to rewrite your code in a weird way.



> But programmers want threads conceptually, semantically. Write sequential logic and don't use strange annotations like "async". In other words, if async/await is so good, why not make all functions in the language implicitly async, and in stead of "await" just use normal function calls? Then you will suddenly be programming in threads.

Some programmers do, but many want exactly the opposite as well. Most of the time I don't care if it's an OS blocking syscall or a non-blocking one, but I do care about understanding the control flow of the program I'm reading and see where there's waiting time and how to make them run concurrently.

In fact, I'd kill to have a blocking/block keyword pair whenever I'm working with blocking functions, because they can surreptitiously slow down everything without you paying attention (I can't count how many pieces of software I've seen with blocking syscalls in the UI thread, leading to frustratingly slow apps!).



This is a really common comment to see on HN threads about async/await vs fibers/virtual threads.

What you're asking for is performance to be represented statically in the type system. "Blocking" is not a useful concept for this. As avodonosov is pointing out, nothing stops a syscall being incredibly fast and for a regular function that doesn't talk to the kernel at all being incredibly slow. The former won't matter for UI responsiveness, the latter will.

This isn't a theoretical concern. Historically a slow class of functions involved reading/writing to the file system, but in some cases now you'll find that doing so is basically free and you'll struggle to keep the storage device saturated without a lot of work on multi-threading. Fast NVMe SSDs like found in enterprise storage products or MacBooks are a good example of this.

There are no languages that reify performance in the type system, partly because it would mean that optimizing a function might break the callers, which doesn't make sense, and partly because the performance of a function can vary wildly depending on the parameters it's given.

Async/await is therefore basically a giant hack and confuses people about performance. It also makes maintenance difficult. If you start with a function that isn't async and suddenly realize that you need to read something from disk during its execution, even if it only happens sometimes or when the user passes in a specific option, you are forced by the runtime to mark the function async which is then viral throughout the codebase, making it a breaking change - and for what? The performance impact of reading the file could easily be lost in the noise on modern hardware and operating systems.

The right way to handle this is the Java approach (by pron, who is posting in this thread). You give the developer threads and make it cheap to have lots of them. Now break down tasks into these cheap threads and let the runtime/OS figure out if it's profitable to release the thread stack or not. They're the best placed to do it because it's a totally dynamic decision that can vary on a case-by-case basis.



> It also makes maintenance difficult. If you start with a function that isn't async and suddenly realize that you need to read something from disk during its execution, even if it only happens sometimes or when the user passes in a specific option, you are forced by the runtime to mark the function async which is then viral throughout the codebase, making it a breaking change - and for what? The performance impact of reading the file could easily be lost in the noise on modern hardware and operating systems.

You'll typically have an idea of whether or not a function performs IO from the start. Changing that after the fact violates the users' conceptual model and expectation of it, even if all existing code happens to keep working.



If you want to go full Haskell on the problem for purity-related reasons, by all means be my guest. I strongly approve.

However, unless you're in such a language, warping my entire architecture around that objection does not provide a good cost-benefit tradeoff. I've got a lot of fish to fry and a lot of them are bigger than this in practice. Heck, there's still plenty of programmers who will consider it as unambiguous feature that they can add IO to anything they want or need to and consider it a huge negative when they can't, and a lot of programmers who don't practice an IO isolation and don't even conceive of "this function is guaranteed to not do any IO/be impure" as a property a function can have.



In any system that uses mmap or swap it's a meaningless distinction anyway (which is obviously nearly all of them outside of embedded RTOS). Accessing even something like the stack can trigger implicit/automatic IO of arbitrary complexity, so the concept of a function that doesn't do IO is meaningless to begin with. Async/await isn't justified by any kind of interesting type theory, it exists to work around limitations in runtimes and language designs.


> You'll typically have an idea of whether or not a function performs IO from the start.

I think GP's point is: why does that matter? Much writing on Async/Await roughly correlates IO with "slow". GP rightly points out that "slow" is imprecise, changes, means different things to different people and/or use cases.

I completely get the intuition: "there's lag in the [UI|server|...], what's slowing it down?". But the reality is that trying to formalise "slow" in the type system is nigh on impossible - because "slow" for one use case is perfectly acceptable for another.



Even if you ignore performance completely, IO is unreliable. IO is unpredictable. IO should be scrutinized.


While slow in absolute depends on lots of factors, the relative slowness of things doesn't so much. Whatever the app or the device, accessing a register is always going to be faster than random places in RAM, which is always going to be faster than fetching something on disk and even moreso if we talk about fetching stuff over the network. No matter how hardware progresses, latency hierarchy is doomed to stay.

That doesn't mean it's the only factor of slowness, and that async/await solves all issues, but it's a tool that helps, a lot, to fight against very common sources of performance bugs (like how the borrow checker is useful when it protects against the nastiest class of memory vulnerabilities, even if it cannot solve all security issues).

Because the situation where “my program is stupidly waiting for some IO even though I don't even need the result right now and I could do something in the meantime” is something that happens a lot.



> Whatever the app or the device, accessing a register is always going to be faster than random places in RAM, which is always going to be faster than fetching something on disk and even moreso if we talk about fetching stuff over the network.

The network is special: the time it takes to fetch something over the network can be arbitrarily large, or even infinite (this can also apply to disk when running over networked filesystems), while for registers/RAM/disk (as long as it's a local disk which is not failing) the time it takes is bounded. That's the reason why async/await is so popular when dealing with the network.



PCIe is a network. USB is a network. There is no such thing as a resource with a guaranteed response time.


There are plenty of language ecosystems where there's no particular expectations up front about whether or when a library will do IO. Consider any library that introduces some sort of config file or registry keys, or an OS where a function that was once purely in-process is extracted to a sandboxed extra process.


> There are plenty of language ecosystems where there's no particular expectations up front about whether or when a library will do IO.

There are languages that don't enforce the expectation on a type level, but that doesn't mean that people don't have expectations.

> Consider any library that introduces some sort of config file or registry keys

Yeah, please don't do this behind my back. Load during init, and ask for permission first (by making me call something like Config::load() if I want to respect it).

> or an OS where a function that was once purely in-process is extracted to a sandboxed extra process.

Slightly more reasonable, but this still introduces a lot of considerations that the application developer needs to be aware of (how should the library find its helper binary? what if the sandboxing mechanism fails or isn't available?).



For the sandbox example I was thinking of desktop operating systems where things like file IO can become brokered without apps being aware of it. So the API doesn't change, but the implementation introduces IPC where previously there wasn't any. In practice it works fine.


> There are no languages that reify performance in the type system,

Async/await is a way of partially doing... just that, but without a having to indicate what is "blocking", and if an async function blocks, well, you'll be unhappy, so don't do that. For a great deal of things this is plenty good enough, but from a computer science perspective it's deeply unsatisfying because one would want the type system to prevent making such mistakes.

At least with async/await an executor could start more threads when an async thread makes a known-blocking call, thus putting a band-aid on the problem.

Perhaps the compiler could reason about the complexity of code (e.g., recursion and nested loops w/ large or unclear bounds -> accidentally quadratic -> slow -> "blocking") and decide if a pure function is "blocking" by dint of being slow. File I/O libraries could check if the underlying devices are local and fast vs. remote and slow, and then file I/O could always be async but completing before returning when the I/O is thought to be fast. This all feels likely to cause more problems than it solves.

If green threads turn out not to be good enough then it's easier to accept the async/await compromise and reason correctly about blocking vs. not.



You can't encode everything about performance in the type system, but that doesn't mean you cannot do it at all: having a type system that allows you to control memory layout and allocation is what makes C++ and Rust faster than most languages. And regarding what you say about storage access: storage bandwidth is now high, but latency when accessing an SSD is still much higher than accessing RAM, and network is even worse. And it will always be the case no matter what progress hardware makes, because of the speed of light.

Saying that async/await doesn't help with all performance issues is like saying Rust doesn't prevent all bugs: the statement is technically correct, but that doesn't make it interesting.

> Async/await is therefore basically a giant hack and confuses people about performance. It also makes maintenance difficult.

Many developers have embraced the async/await model with delight, because it instead makes maintenance easier by making the intent of the code more explicit.

It's been trendy on HN to bash async/await, but you are missing the mist crucial point about software engineering: code is written for humans and is read much more than written. Async/await may be slightly more tedious to write (it's highly context dependent though, when you have concurrent tasks to execute or need cancellation, it becomes much easier with futures).

> The right way to handle this is the Java approach (by pron, who is posting in this thread)

No it's not, and Mr Pressler's has repeatedly shown that he misses the social and communication aspects, so it's not entirely surprising.



But all functions are blocking.

   fn foo() {bar(1, 2);}
   fn bar(a, b) {return a + b;}
Here bar is a blocking function.


Difference is in quantities. bar blocks for nanoseconds, blocking that the GP talks about affects the end user, which means it's in seconds.


No they aren't, and that's exactly my point.

Most functions aren't doing any syscall at all, and as such they aren't either blocking or non-blocking.

Now because of path dependency and because we've been using blocking functions like regular functions, we're accustomed to think that blocking is “normal”, but that's actually a source of bugs as I mentioned before. In reality, async functions are more “normal” than regular functions: they don't do anything fancy, they just return a value when you call them, and what they return is a future/promise. In fact you don't even need to use any async anotation for a function to be async in Rust, this is an async function:

    fn toto() -> impl Future> {
        unimplemented!();
    }

The async keyword exists simply so that the compiler knows it has to desugar the await inside the function into a state machine. But since Rust has async blocks it doesn't even need async on functions at all, the information you need comes from the type of the return value, that is a future.

Blocking functions, on the contrary, are utterly bizarre. In fact, you cannot make one yourself, you must either call another blocking function[1] or do a system call on your own using inline assembly. Blocking functions are the anomaly, but many people miss that because they've lived with them long enough to accept them as normal.

[1] because blockingness is contagious, unlike asynchronousness which must be propagated manually, yes ironically people criticizing async/await get this one backward too



"makes certain syscalls" is a highly unconventional definition of "blocking" that excludes functions that spin wait until they can pop a message from a queue.

If your upcoming systems language uses a capabilities system to prevent the user from inadvertently doing things that may block for a long time like calling open(2) or accessing any memory that is not statically proven to not cause a page fault, I look forward to using it. I hope that these capabilities are designed so that the resulting code is more composable than Rust code. For example it would be nice to be able to use the Reader trait with implementations that source their bytes in various different ways, just as you cannot in Rust.



Blocking syscalls are a well defined and well scoped class of problems, sure there are other situations where the flow stops and a keyword can't save you from everything.

Your reasoning is exactly similar to the folks who say “Rust doesn't solve all bugs” because it “just” solve the memory safety ones.



I may be more serious than you think. Having worked on applications in which blocking for multiple seconds on a "non-blocking syscall" or page fault is not okay, I think it would really be nice to be able to statically ensure that doesn't happen.


I'm not disputing that, in the general case I suspect this is going to be undecidable, and that you'd need careful design to carve out a subset of the problem that is statically addressable (akin to what rust did for memory safety, by restricting the expressiveness of the safe subset of the languages).

For blocking syscalls alone there's not that much PL research to do though and we could get the improvement practically for free, that's why I consider them to be different problems (also because I suspect they are much more prevalent given how much I've encountered them, but it could be a bias on my side).



Any function can block if memory it accesses is swapped out.


bar blocks waiting for the CPU to add the numbers.


Nope it doesn't, in the final binary the bar function doesn't even exist anymore, as the optimizer inlined it, and CPUs have been using pipelining and speculative execution for decades now, they don't block on single instruction. That's the problem with abstractions designed in the 70s, they don't map well with the actual hardware we have 50 years after…


Sure, unless it is the first time you are executing that line of code and you have to wait for the OS to slowly fault it in across a networked filesystem.


Make `a + b` `A * B` then, multiplication of two potentially huge matrices. Same argument still holds, but now it's blocking (still just performing addition, only an enormous number of times).


It's not blocking, it's doing actual work.

Blocking is the way used by the old programming paradigm to deal with asynchronous actions, and it works by behaving the same way as when the computer actually computes thing, so that's where the confusion comes from. but the two situations are conceptually very different: in one case, we are idle (but don't see it), in another case we're busy doing actual work. Maybe in case 2. we could optimize the algorithm so that we spend more time, but that's not sure, whereas in case 1. there's something obvious to do to speed things up: do something at the same time instead of waiting mindlessly. Having a function marked async gives you a pointer that you can actually run it concurrently to something else and expect speed up, whereas with blocking syscall there's no indication in the code that those two functions you're calling next to each other with not data dependency between them would gain a lot to be run concurrently by spawning two threads.

BTW, if you want something that's more akin to blocking, but at a lower level, it's when the CPU has to load data from RAM: it's really blocked doing nothing useful. Unfortunately that's not something you can make explicit in high-level languages (or at least, the design space hasn't been explored) so when these kinds of behavior matters to you, that's when you dive to assembly.



A "non-blocking function" always meant "this function will return before its work is done, and will finish that work in the background through threads/other processes/etc". All other functions are blocking by default, including that simple addition "bar" function above.


Your definition is at odds with (for instance) JavaScript's implementation of a non-blocking function though, as it will perform computation until the first await point before returning (unlike Rust future which are lazy, that is: do no work before they are awaited on).

As I said before, most of what you call a “blocking function” is actually a “no opinion function” but since in the idiosyncrasy of most programming languages blocking functions are being called like “no opinion” ones, you are mixing them up. But it's not a fundamental rule. You could imagine a language where blocking functions (which contains an underlying blocking syscall) are being called with the block keyword and where regular functions are just called like functions. There's no relation between regular functions and blocking functions except path dependency that led to this particular idiosyncrasy we live in, it is entirely contingent.



> Your definition is at odds with (for instance) JavaScript's implementation of a non-blocking function though, as it will perform computation until the first await point before returning

Yes, that's syntactic sugar for returning a promise. This pattern is something we've long called a non-blocking function in Javascript. The first part that's not in the promise is for setting it up.



If you define a non-blocking function to be what you decide is non-blocking, that's a bit of cheating don't you think? ;)

How about this function:

    async fn toto( input: u8) -> bool {
      if input % 2 == 0 {
        true
      } else{
        false
      }
    }
Is it a non-blocking one or not according to your criteria?


We were just taking Javascript and now you're back to Rust? That same function with that same keyword acts differently in the two languages.

My best guess is you're defining it (mostly?) by the syntax while I'm defining it by how the function acts. By what I'm talking about, that's a non-blocking function in Rust, but written the exact same way in Javascript it's a blocking function.



> We were just taking Javascript and now you're back to Rust? That same function with that same keyword acts differently in the two languages.

Yes, that's the point, and they in both case they are being called non-blocking functions, despite behaving differently.

> My best guess is you're defining it (mostly?) by the syntax while I'm defining it by how the function acts

I'm defining them how they are commonly being referred to, whereas you're using some arbitrary criteria that is not even consistent, as you'll see below.

> By what I'm talking about, that's a non-blocking function in Rust, but written the exact same way in JavaScript it's a blocking function.

Gotcha!

If we get back to your definition above

> A "non-blocking function" always meant "this function will return before its work is done, and will finish that work in the background through threads/other processes/etc".

Then it should be a “blocking function” because what this function returns is basically a private enum with two variants and a poll method to unwrap them. There's nothing running in the background ever.

In fact, in Rust an async function never runs anything in the background: all it does is returning a Future which is a passive state machine, there's no magic in there, and it's not very different from a closure or an Option actually (in fact, in this example, it's practically an the same as an Option). Then you can send this state machine to an executor that will do the actual work, polling it to completion. But this process doesn't necessarily happens in the background (you can even block_on the future to make that execution synchronous).

So in reality there are two kinds of functions, the ones that returns immediately (async and regular functions) and the ones that block the execution and will return later on (the “blocking” functions). And among the two kinds of functions that do not block, there's also two flavors: the ones that returns results that are immediately ready, and the ones that returns results that won't be ready before some time.



I don't know what to tell you, but that is how sequential code works. Sure you can find some instruction level parallelism in the code and your optimizer may be able to do it across function boundaries, but that is mostly a happy accident. Meanwhile HDLs are the exact opposite. Parallel by default and you have to build sequential execution yourself. What is needed for both HLS and parallel programming is a parallel by default hybrid language that makes it easy to write both sequential and parallel code.


Except, unless you're using atomics or volatiles, you have no guaranties that the code you're writing sequentially is going to be executed this way…


>In other words, if async/await is so good, why not make all functions in the language implicitly async, and instead of "await" just write normal function calls? Then you will suddenly be programming in threads.

Why not go one step further and invent "Parallel Rust"? And by parallel I mean it. Just a nice little keyword "parallel {}" where every statement inside the parallel block is executed in parallel, the same way it is done in HDLs. Rust's borrow checker should be able to ensure parallel code is safe. Of course one problem with this strategy is that we don't exactly have processors that are designed to spawn and process micro-threads. You would need to go back all the way to Sun's SPARC architecture for that and then extend it with the concept of a tree based stack so that multiple threads can share the same stack.



> Just a nice little keyword "parallel {}" where every statement inside the parallel block is executed in parallel, the same way it is done in HDLs. Rust's borrow checker should be able to ensure parallel code is safe.

The rayon crate lets you do something quite similar.



I believe the answer is "that implies a runtime", and Rust as a whole is not willing to pull that up into a language requirement.

This is in contrast to Haskell, Go, dynamic scripting languages, and, frankly, nearly every other language on the market. Almost everything has a runtime nowadays, and while each individually may be fine they don't always play well together. It is important that as C rides into the sunset (optimistic and aspirational, sure, but I hope and believe also true) and C++ becomes an ever more complex choice to make for various reasons that we have a high-power very systems-oriented programming language that will make that choice, because someone needs to.



That would be a good step forward, I support it :)

BTW, do we need the `parallel` keyword, or better to simply let all code be parallel by default?



Haskell has entered the chat…

However, almost all of the most popular programming languages are imperative. I assume most programmers prefer to think of our programs as a series of steps which execute in sequence.

Mind you, arguably excel is the most popular programming language in use today, and it has exactly this execution model.



You do not need to spawn threads/tasks eagerly. You can do it lazily on work-stealing. See cilk++.


Doesn't rayon have a syntax like that?


There are a lot of moments not covered. For example:

- async/await runs in context of one thread, so there is no need for locks or synchronization. Unless one runs async/await in multiple threads to actually utilize CPU cores, then locks and synchronization are necessary again. This complexity may be hidden in some external code. For example instead of synchronizing access to a single database connection it is much easier to open one database connection per async task. However such approach may affect performance, especially with sqlite and postgres.

- error propagation in async/await is not obvious. Especially when one tries to group up async tasks. Happy eyeballs are a classic example.

- since network I/O was mentioned, backpressure should also be mentioned. CPython implementation of async/await notoriously lacks network backpressure causing some problems.



Async/await just like threads is a concurrency mechanism and also always requires locks when accessing the shared memory. Where does your statement come from?


If you perform single threaded async in Rust, you can drop down to the cheap single threaded RefCell rather than the expensive multithreaded Mutex/RwLock


That's one example of a lock you might eliminate, but there are plenty of other cases where it's impossible to eliminate locks even while single threaded.

Consider, for example, something like this (not real rust, I'm rusty there)

    lock {
      a = foo();
      b = io(a).await;
      c = bar(b);
    }
Eliminating this lock is unsafe because a, b, and c are expected to be updated in tandem. If you remove the lock, then by the time you reach c, a and b may have changed under your feet in an unexpected way because of that await.


Yeah but this problem goes away entirely if you just don’t await within a critical region like that.

I’ve been using nodejs for a decade or so now. Nodejs can also suffer from exactly this problem. In all that time, I think I’ve only reached for a JS locking primitive once.



There is no problem here with the critical region. The problem would be removing the critical region because "there's just one thread".

This is incorrect code

      a = foo();
      b = io(a).await;
      c = bar(b);
Without the lock, `a` can mutate before `b` is done executing which can mess with whether or not `c` is correct. The problem is if you have 2 independent variables that need to be updated in tandem.

Where this might show up. Imagine you have 2 elements on the screen, a span which indicates the contents and a div with the contents.

If your code looks like this

    mySpan.innerText = "Loading ${foo}";
    myDiv.innerText = load(foo).await;
    mySpan.innerText = "";
You now have incorrect code if 2 concurrent loads happen. It could be the original foo, it could be a second foo. There's no way to correctly determine what the content of `myDiv` is from an end user perspective as it depends entirely on what finished last and when. You don't even know if loading is still happening.


I absolutely agree that that code looks buggy. Of course it is - if you just blindly mix view and model logic like that, you’re going to have a bad day. How many different states can the system be in? If multiple concurrent loads can be in progress at the same time, the answer is lots.

But personally I wouldn’t solve it with a lock. I’d solve it by making the state machine more explicit and giving it a little bit of distance from the view logic. If you don’t want multiple loads to happen at once, add an is_loading variable or something to track the loading state. When in the loading state, ignore subsequent load operations.



> add an is_loading variable or something to track the loading state.

Which is definitionally a mutex AKA a lock. However, it's not a lock you are blocking on but rather one that you are trying and leaving.

I know it doesn't look like a traditional lock, but in a language like javascript or python it's a valid locking mechanism. For javascript that's because of the single thread execution model a boolean variable is guaranteed to be consistently set for multiple concurrent actions.

That is to say, you are thinking about concurrency issues, you just aren't thinking about them in concurrency terms.

Here's the Java equivalent to that concept

https://docs.oracle.com/javase/8/docs/api/java/util/concurre...



Yeah I agree. The one time I wrote a lock in javascript it worked like you were hinting at. You could await() the lock's release, and if multiple bits of code were all waiting for the lock, they would acquire it in turn.

But again, I really think in UI code it makes a lot more sense to be clear about what the state is, model it explicitly and make the view a "pure" expression of that state. In the code above:

- The state is 0 or more promises loading data.

- The state is implicit. Ie, the code doesn't list the set of loading promises which are being awaited at any point in time. Its not obvious that there is a collection going on.

- The state is probably wrong. The developer probably wants either 0 or 1 loading states. (Or maybe a queue of them). Because the state hasn't been modelled explicitly, it probably hasn't been considered enough

- The view is updated incorrectly based on the state. If 2 loads happen at the same time, then 1 finishes, the UI removes the "loading..." indicator from the UI. Correct view logic should ensure that the UI is deterministic based on the internal state. 1 in-progress load should result in the UI saying "loading...".

Its a great example. With code like this I think you should always carefully and explicitly consider all of the states of your system, and how the state should change based on user action. Then all UI code can flow naturally from that.

A lock might be a good tool. But without thinking about how you want the program to behave, we have no way to tell. And once you know how you want your program to behave, I find locks to be usually unnecessary.



I think a lot of this type of problem goes away with immutable data and being more careful with side effects (for example, firing them all at once at the end rather than dispersed through the calculation)


> Where does your statement come from?

This is how async/await works in Node (which is single-threaded) so most developers think this is how it works in every technology.



Even in Node, if you perform asynchronous operations on a shared resource, you need synchronization mechanisms to prevent interleaving of async functions.

There has been more than one occasion when I "fixed" a system in NodeJS just by wrapping some complex async function up in a mutex.



This lacks quite a bit of nuance. In node you are guaranteed that synchronous code between two awaits will run to completion before another task(that could access your state) from the event loop gets a turn; with multi-threaded concurrency you could be preempted between any two machine instructions. So while you _do_ have to serialize access to shared IO resources, you do _not_ have to serialize access to memory(just add the connection to the hashset, no locks).

What you usually see with JS for concurrency of shared IO resources in practice is that they are "owned" by the closure of a flow of async execution and rarely available to other flows. This architecture often obviates the need to lock on the shared resource at all as the natural serialization orchestrated by the string of state machines already naturally accomplishes this. This pattern was even quite common in the CPS style before async/await.

For example, one of the first things an app needs do before talking to a DB is to get a connection which is often retrieved by pulling from a pool; acquiring the reservation requires no lock, and by virtue of the connection being exclusively closed over in the async query code, it also needs no locking. When the query is done, the connection can be replaced to the pool sans locking.

The place where I found synchronization most useful was in acquiring resources that are unavailable. Interestingly, an async flow waiting on a signal for a shared resource resembles a channel in golang in how it shifts the state and execution to the other flow when a pooled resource is available.

All this to say, yeah I'm one of the huge fans of node that finds rust's take on default concurrency painfully over complicated. I really wish there was an event-loop async/await that was able to eschew most of the sync, send, lifetime insanity. While I am very comfortable with locks-required multithreaded concurrency as well, I honestly find little use for it and would much prefer to scale by process than thread to preserve the simplicity of single-threaded IO-bound concurrency.



> So while you _do_ have to serialize access to shared IO resources, you do _not_ have to serialize access to memory(just add the connection to the hashset, no locks).

No, this can still be required. Nothing stops a developer setting up a partially completed data structure and then suspending in the middle, allowing arbitrary re-entrancy that will then see the half-finished change exposed in the heap.

This sort of bug is especially nasty exactly because developers often think it can't happen and don't plan ahead for it. Then one day someone comes along and decides they need to do an async call in the middle of code that was previously entirely synchronous, adds it and suddenly you've lost data integrity guarantees without realizing it. Race conditions appear and devs don't understand it because they've been taught that it can't happen if you don't have threads!



> So while you _do_ have to serialize access to shared IO resources, you do _not_ have to serialize access to memory

Yes, in Node you don't get the usual data races like in C++, but data-structure races can be just as dangerous. E.g. modifying the same array/object from two interleaved async functions was a common source of bugs in the systems I've referred to.

Of course, you can always rely on your code being synchronous and thus not needing a lock, but if you're doing anything asynchronous and you want a guarantee that your data will not be mutated from another async function, you need a lock, just like in ordinary threads.

One thing I deeply dislike about Node is how it convinces programmers that async/await is special, different from threading, and doesn't need any synchronisation mechanisms because of some Node-specific implementation details. This is fundamentally wrong and teaches wrong practices when it comes to concurrency.



But single-threaded async/await _is_ special and different from multi-threaded concurrency. Placing it in the same basket and prescribing the same method of use is fundamentally wrong and fails to teach the magic of idiomatic lock free async javascript.

I'm honestly having a difficult time creating a steel man js sample that exhibits data races unless I write weird C-like constructs and ignore closures and async flows to pass and mutate multi-element variables by reference deep into the call stack. This just isn't how js is written.

When you think about async/await in terms of shepherding data flows it becomes pretty easy to do lock free async/await with guaranteed serialization sans locks.



> I'm honestly having a difficult time creating a steel man js sample that exhibits data races

I can give you a real-life example I've encountered:

    const CACHE_EXPIRY = 1000; // Cache expiry time in milliseconds

    let cache = {}; // Shared cache object

    function getFromCache(key) {
      const cachedData = cache[key];
      if (cachedData && Date.now() - cachedData.timestamp  setTimeout(resolve, 100));
      mockFetchCount += 1;
      return `result from ${url}`;
    }

    async function fetchDataAndUpdateCache(key) {
      const cachedData = getFromCache(key);
      if (cachedData) {
        return cachedData;
      }

      // Simulate fetching data from an external source
      const newData = await mockFetch(`https://example.com/data/${key}`); // Placeholder fetch

      updateCache(key, newData);
      return newData;
    }

    // Race condition:
    (async () => {
      const key = 'myData';

      // Fetch data twice in a sequence - OK
      await fetchDataAndUpdateCache(key);
      await fetchDataAndUpdateCache(key);
      console.log('mockFetchCount should be 1:', mockFetchCount);

      // Reset counter and wait cache expiry
      mockFetchCount = 0;
      await new Promise(resolve => setTimeout(resolve, CACHE_EXPIRY));

      // Fetch data twice concurrently - we executed fetch twice!
      await Promise.all([fetchDataAndUpdateCache(key), fetchDataAndUpdateCache(key)]);
      console.log('mockFetchCount should be 1:', mockFetchCount);
    })();

This is what happens when you convince programmers that concurrency is not a problem in JavaScript. Even though this cache works for sequential fetching and will pass trivial testing, as soon as you have concurrent fetching, the program will execute multiple fetches in parallel. If server implements some rate-limiting, or is simply not capable of handling too many parallel connections, you're going to have a really bad time.

Now, out of curiosity, how would you implement this kind of cache in idiomatic, lock-free javascript?



> how would you implement this kind of cache in idiomatic, lock-free javascript?

The simplest way is to cache the Promise instead of waiting until you have the data:

    -async function fetchDataAndUpdateCache(key: string) {
    +function fetchDataAndUpdateCache(key: string) {
       const cachedData = getFromCache(key);
       if (cachedData) {
         return cachedData;
       }

       // Simulate fetching data from an external source
     -const newData = await mockFetch(`https://example.com/data/${key}`); // Placeholder fetch
     +const newData = mockFetch(`https://example.com/data/${key}`); // Placeholder fetch

       updateCache(key, newData);
       return newData;
     }
From this the correct behavior flows naturally; the API of fetchDataAndUpdateCache() is exactly the same (it still returns a Promise), but it’s not itself async so you can tell at a glance that its internal operation is atomic. (This does mildly change the behavior in that the expiry is now from the start of the request instead of the end; if this is critical to you you can put some code in `updateCache()` like `data.then(() => cache[key].timestamp = Date.now()).catch(() => delete cache[key])` or whatever the exact behavior you want is.)

I‘m not even sure what it would mean to “add a lock” to this code; I guess you could add another map of promises that you’ll resolve when the data is fetched and await on those before updating the cache, but unless you’re really exposing the guts of the cache to your callers that’d achieve exactly the same effect but with a lot more code.



Ok, that's pretty neat. Using Promises themselves in the cache instead of values to share the source of data itself.

While that approach has a limitation that you cannot read the data from inside the fetchDataAndUpdateCache (e.g. to perform caching by some property of the data), that goes beyond the scope of my example.

> I‘m not even sure what it would mean to “add a lock” to this code

It means the same as in any other language, just with a different implementation:

    class Mutex {
        locked = false
        next = []

        async lock() {
            if (this.locked) {
                await new Promise(resolve => this.next.push(resolve));
            } else {
                this.locked = true;
            }
        }

        unlock() {
            if (this.next.length > 0) {
                this.next.shift()();
            } else {
                this.locked = false;
            }
        }
    }
I'd have a separate map of keys-to-locks that I'd use to lock the whole fetchDataAndUpdateCache function on each particular key.


Don't forget to fung futures that are fungible for the same key.

ETA: I appreciate the time you took to make the example, also I changed the extension to `mjs` so the async IIFE isn't needed.

  const CACHE_EXPIRY = 1000; // Cache expiry time in milliseconds
  
  let cache = {}; // Shared cache object
  let futurecache = {}; // Shared cache of future values
  
  function getFromCache(key) {
    const cachedData = cache[key];
    if (cachedData && Date.now() - cachedData.timestamp  setTimeout(resolve, 100));
    mockFetchCount += 1;
    return `result from ${url}`;
  }
  
  async function fetchDataAndUpdateCache(key) {
    // maybe its value is cached already
    const cachedData = getFromCache(key);
    if (cachedData) {
      return cachedData;
    }
  
    // maybe its value is already being fetched
    const future = futurecache[key];
    if(future) {
      return future;
    }
  
    // Simulate fetching data from an external source
    const futureData = mockFetch(`https://example.com/data/${key}`); // Placeholder fetch
    futurecache[key] = futureData;
  
    const newData = await futureData;
    delete futurecache[key];
  
    updateCache(key, newData);
    return newData;
  }
  
  const key = 'myData';
  
  // Fetch data twice in a sequence - OK
  await fetchDataAndUpdateCache(key);
  await fetchDataAndUpdateCache(key);
  console.log('mockFetchCount should be 1:', mockFetchCount);
  
  // Reset counter and wait cache expiry
  mockFetchCount = 0;
  await new Promise(resolve => setTimeout(resolve, CACHE_EXPIRY));
  
  // Fetch data twice concurrently - we executed fetch twice!
  await Promise.all([...Array(100)].map(() => fetchDataAndUpdateCache(key)));
  console.log('mockFetchCount should be 1:', mockFetchCount);


I see, this piece of code seems to be crucial:

    // maybe its value is already being fetched
    const future = futurecache[key];
    if(future) {
      return future;
    }
It indeed fixes the problem in a JS lock-free way.

Note that, as wolfgang42 has shown in a sibling comment, the original cache map isn't necessary if you're using a future map, since the futures already contain the result:

    async function fetchDataAndUpdateCache(key) {
        // maybe its value is cached already
        const cachedData = getFromCache(key);
        if (cachedData) {
          return cachedData;
        }

        // Simulate fetching data from an external source
        const newDataFuture = mockFetch(`https://example.com/data/${key}`); // Placeholder fetch

        updateCache(key, newDataFuture);
        return newDataFuture;
    }
---

But note that this kind of problem is much easier to fix than to actually diagnose.

My hypothesis is that the lax attutide of Node programmers towards concurrency is what causes subtle bugs like these to happen in the first place.

Python, for example, also has single-threaded async concurrency like Node, but unlike Node it also has all the standard synchronization primitives also implemented in asyncio: https://docs.python.org/3/library/asyncio-sync.html



Wolfgang's optimization is very nice, I also found interesting his signal of a non-async function that returns a promise as an "atomic". I don't particularly like typed JS, so it would be less visible to me.

Absolutely agree on the observability of such things. One area I think shows some promise, though the tooling lags a bit, is in async context[0] flow analysis.

One area I have actually used it so far is in tracking down code that is starving the event loop with too much sync work, but I think some visualization/diagnostics around this data would be awesome.

If we view Promises/Futures as just ends of a string of a continued computation, whos resumption is gated by some piece of information, the points between where you can weave these ends together is where the async context tracking happens and lets you follow a whole "thread" of state machines that make up the flow.

Thinking of it this way, I think, also makes it more obvious how data between these flows is partitioned in a way that it can be manipulated without locking.

As for the node dev's lax attitude, I would probably be more agressive and say it's an overall lack of formal knowledge on how computing and data flow works. As an SE in DevOps a lot of my job is to make software work for people that don't know how computers, let alone platforms, work.

[0]: https://nodejs.org/api/async_context.html



async can be scarier for locks since a block of code might depend on having exclusive access, and since there wasn't an await, it got it. Once you add an await in the middle, the code breaks. Threading at least makes you codify what actually needs exclusive access.

async also signs you up for managing your own thread scheduling. If you have a lot of IO and short CPU-bound code, this can be OK. If you have (or occasionally have) CPU-bound code, you'll find yourself playing scheduler.



Yeah once your app gets to be sufficiently complex you will find yourself needing mutexes after all. Async/await makes the easy parts of concurrency easy but the hard parts are still hard.


I have lots of issues with async/await, but this is my primary beef with async/await:

Remember the Gang of Four book "Design Patterns"? It was basically a cookbook on how to work around the deficiencies of (mostly) C++. Yet everybody applied those patterns inside languages that didn't have those deficiencies.

Rust can run multiple threads just fine--it's not Javascript. As such, it didn't have to use async/await. It could have tried any of a bunch of different solutions. Rust is a systems language, after all.

However, async/await was necessary in order to shove Rust down the throats of the Javascript programmers who didn't know anything else. Quoting without.boats:

https://without.boats/blog/why-async-rust/

> I drove at async/await with the diligent fervor of the assumption that Rust’s survival depended on this feature.

Whether async/await was even a good fit for Rust technically was of no consequence. Javascript programmers were used to async/await so Rust was going to have async/await so Rust could be jammed down the throats of the Javascript network services programmers--technical consequences be damned.



Async/await was invented for C#, another multithreaded language. It was not designed to work around a lack of true parallelism. It is instead designed to make it easier to interact with async IO without having to resort to manually managed thread pools. It basically codifies at the language level a very common pattern for writing concurrent code.

It is true though that async/await has a significant advantage compared to fibers that is related to single threaded code: it makes it very easy to add good concurrency support on a single thread, especially in languages which support both. In C#, it was particularly useful for executing concurrent operations from the single GUI thread of WPF or WinForms, or from parts of the app which interact with COM. This used the SingleThreadedExecutor, which schedules tasks on the current thread, so it's safe to run GUI updates or COM interactions from a Task, while also using any other async/await code, since tasks inherit their executor.



Yeah, Microsoft looked at callback hell, realized that they had seen this one before, dipped into the design docs for F# and lifted out the syntactic sugar of monads. And it worked fine. But really, async/await is literally callbacks. The keyword await just wraps the rest of the function in a lambda and stuffs it in a callback. It's fully just syntactic sugar. It's a great way of simplifying how callback hell is written, but it's still callback hell in the end. Where having everything run in callbacks makes sense, it makes sense. Where it doesn't it doesn't. At some point you will start using threads, because your use case calls for threads instead of callbacks.


Most compilers don't just wrap the rest of the function into a lambda but build a finite state machine with each await point being a state transition. It's a little bit more than just "syntactic sugar" for "callbacks". In most compilers it is most directly like the "generator" approach to building iterators (*function/yield is ancient async/await).

I think the iterator pattern in general is a really useful reference to keep in mind. Of course async/await doesn't replace threads just like iterators don't replace lists/arrays. There are some algorithms you can more efficiently write as iterators rather than sequences of lists/arrays. There are some algorithms you can more efficiently write as direct list/array manipulation and avoid the overhead of starting iterator finite state machines. Iterator methods are generally deeply composable and direct list/array manipulation requires a lot more coordination to compose. All of those things work together to build the whole data pipeline you need for your app. So too, async/await makes it really easy to write some algorithms in a complex concurrent environment. That async/await runs in threads and runs with threads. It doesn't eliminate all thinking about threads. async/await is generally deeply composable and direct thread manipulation needs more work to coordinate. In large systems you probably still need to think about both how you are composing your async/await "pipelines" and also how your threads are coordinated. The benefits of composition such as race/await-all/schedulers/and more are generally worth the extra complexity and overhead (mental and computation space/time), which is why the pattern has become so common so quickly. Just like you can win big with nicely composed stacks of iterator functions. (Or RegEx or Observables or any of the other many cases where designing complex state machines both complicates how the system works and eventually simplifies developer experience with added composability.)



I generally don't agree with the direction withoutboats went with asynchricity but you are reading in a whole lot more into that sentence than is really there. It is very clear (based on his writing, in this and other articles) that he went with the solution because he thinks it is the right one, on a technical level.

I don't agree, but making it sound like it was about marketing the language to JavaScript people is just wrong.



> was about marketing the language to JavaScript people is just wrong.

No it seems very right to me. Rust despite being "Systems language" was not satisfied with market size of systems programing and they really needed all those millions of JS programmers to make language a big success.



Threads have a cost. Context switching between them at the kernel level has a cost. There are some workloads that gain performance by multiplexing requests on a thread. Java virtual threads, golang goroutines, and dotnet async/await (which is multi threaded like Rust+tokio) all moved this way for _performance_ reasons not for ergonomic or political ones.

It's also worth pointing out that async/await was not originally a JavaScript thing. It's in many languages now but was first introduced in C#. So by your logic Rust introduced it so it could be "jammed down the throats" of all the dotnet devs..



> So by your logic Rust introduced it so it could be "jammed down the throats" of all the dotnet devs..

You're missing his point. His point is that the most popular language, which has the most number of programmers forced the hand of Rust devs.

His point is not that the first language had this feature, it's that the most programmers used this feature, and that was due to the most popular programming language having this feature.



That Rust needed async/await to be palatable to JS devs would only be a problem if we think async/await is not needed in Rust, because it is only useful to work around limitations of JS (single-threaded execution, in this case). If instead async/await is a good feature in its own right (even if not critical), then JS forcing Rust's hand would be at best an annoyance.

And the idea that async/await was only added to JS to work around its limitations is simply wrong. So the OP is overall wrong: async/await is not an example of someone taking something that only makes sense in one language and using it another language for familiarity.



> So the OP is overall wrong: async/await is not an example of someone taking something that only makes sense in one language and using it another language for familiarity.

I don't really understand the counter argument here.

My reading of the argument[1] is that "Popularity amongst developers forced Rust devs hands in adding async". If this is the argument, then a counter argument of "It never (or only) made sense in the popular language (either)" is a non-sequitor.

IOW, if it wasn't added due to technical reasons (which is the original argument, IIRC), then explaining technical reasons for/against isn't a counter argument.

[1] i.e. Maybe I am reading it wrong?



You are not reading it wrong, and your statements are accurate.

My broader point is that the possibility of there being a "technically better" construct was simply not in scope for Rust. In order for Rust to capture Javascript programmers, async/await was the only construct that could possibly be considered.

And, to be fair, it worked. Rust's growth has been almost completely on the back of network services programming.



> all moved this way for _performance_ reasons

They did NOT.

Async performance is quite often (I would even go so far as to say "generally") worse than single threaded performance in both latency AND throughput under most loads that programmers ever see.

Most of the complications of async are much like C#:

1) Async allows a more ergonomic way to deal with a prima donna GUI that must be the main thread and that you must not block. This has nothing to do with "performance"--it is a limitation of the GUI toolkit/Javascript VM/etc..

2) Async adds unavoidable latency overhead and everybody hits this issue.

3) Async nominally allows throughput scaling. Most programmers never gain enough throughput to offset the lost latency performance.



1) it offers a more ergonomic way for concurrency in general. `await Task.WhenAll(tasks);` is (in my opinion) more ergonomic than spinning up a thread pool in any language that supports both.

2) yes, there is a small performance overhead for continuations. Everything is a tradeoff. Nobody is advocating for using async/await for HFT, or in low level languages like C or Zig. We're talking nanoseconds here.. for a typical web API request that's in the 10's of ms that's a drop in the ocean.

3) I wouldn't say it's nominal! I'd argue most non-trivial web workloads would benefit from this increase in throughput. Pre-fork webservers like gunicorn can consume considerably more resources to serve the same traffic than an async stack such as uvicorn+FastAPI (to use Python as an example).

> Most of the complications of async are much like C#

Not sure where you're going with this analogy but as someone who's written back-end web services in basically every language (other than lisp, no hate though), C#/dotnet core is a pretty great stack. If you haven't tried it in a while you should give it a shot.



Eh. Async and to a lesser extent green threads are the only solutions to slowloris HTTP attacks. I suppose your other option is to use a thread pool in your server - but then you need to but hide your web server behind nginx to keep it safe. (And nginx is safe because it internally uses async IO).

Async is also usually wildly faster for networked services than blocking IO + thread pools. Look at some of the winners of the techempower benchmarks. All of the top results use some form of non blocking IO. (Though a few honourable mentions use go - with presumably a green thread per request):

https://www.techempower.com/benchmarks/

I’ve also never seen Python or Ruby get anywhere near the performance of nodejs (or C#) as a web server. A lot of the difference is probably how well tuned v8 and .net are, but I’m sure the async-everywhere nature of javascript makes a huge difference.



Async's perfect use case is proxies though- get a request, go through a small decision tree, dispatch the I/O to the kernel. You don't want proxies doing complex logic or computation, the stuff that creates bottlenecks in the cooperative multithreading.


Most API's (rest, graphql or otherwise) are effectively a proxy. Like you say, if you don't have complex logic and you're effectively mapping an HTTP request to a query, then your API code is just juggling incoming and outgoing responses and this evented/cooperative approach is very effective.


Where does the unavoidable latency overhead come from?

Do you have some benchmarks available?



The comment you are responding to is not wrong about higher async overhead, but it is wrong at everything else either out of lack of experience with the language or out of being confused about what it is that Task and ValueTask solve.

All asynchronous methods (as in, the ones that have async keyword prefixed to them) are turned into state machines, where to live across await, the method's variables that persist across it need to be lifted to a state machine struct, which is then often (but not always) needs to be boxed aka heap allocated. All this makes the cost of what would have otherwise been just a couple of method calls way more significant - single await like this can cost 50ns vs 2ns spent on calling methods.

There is also a matter of heap allocations for state machine boxes - C# is generally good when it comes to avoiding them for (value)tasks that complete synchronously and for hot async paths that complete asynchronously through pooling them, but badly written code can incur unwanted overhead by spamming async methods with await points where it could have been just forwarding a task instead. Years of bad practices arisen from low skill enterprise dev fields do not help this either, with only the switch to OSS and more recent culture shift aided by better out of box analyzers somewhat turning the tide.

This, however, does not stop C#'s task system from being extremely useful for achieving lowest ceremony concurrency across all programming languages (yes, it is less effort than whatever Go or Elixir zealots would have you believe) where you can interleave, compose and aggregate task-returning methods to trivially parallelize/fork/join parts of existing logic leading to massive code productivity improvement. Want to fire off request and do something else? Call .GetStringAsync but don't await it and go back to it later with await when you do need the result - the request will be likely done by then. Instant parallelism.

With that said, Rust's approach to futures and async is a bit different, where-as C#'s each async method is its own task, in Rust the entire call graph is a single task with many nested futures where the size of the sum of all stack frames is known statically hence you can't perform recursive calls within async there - you can only create a new (usually heap-allocated) which gives you what effectively looks a linked list of task nodes as there is no infinite recursion in calculating their sizes. This generally has lower overhead and works extremely well even in no-std no-alloc scenarios where cooperative multi-tasking is realized through a single bare metal executor, which is a massive user experience upgrade in embedded land. .NET OTOH is working on its own project to massively reduce async overhead but once the finished experiment sees integration in dotnet/runtime itself, you can expect more posts on this orange site about it.



> .NET OTOH is working on its own project to massively reduce async overhead

Where can I read more about that?



Initial experiment issue: https://github.com/dotnet/runtime/issues/94620

Experiment results write-up: https://github.com/dotnet/runtimelab/blob/e69dda51c7d796b812...

TLDR: The green threads experiment was a failure as it found (expected and obvious) issues that the Java applications are now getting to enjoy, joining their Go colleagues, while also requiring breaking changes and offering few advantages over existing model. It, however, gave inspiration to subsequent re-examination of current async/await implementation and whether it can be improved by moving state machine generation and execution away from IL completely to runtime. It was a massive success as evidenced by preliminary overhead estimations in the results.



The tl;dr that I got when I read these a few months ago was that C# relies on too much FFI which makes implementing green threads hard and on top of that would require a huge effort to rewrite a lot of stuff to fit the green thread model. Java and Go don’t have these challenges since Go shipped with a huge standard library and Java’s ecosystem is all written in Java since it never had good ffi until recently.


Surely you're not claiming that .NET's standard library is not extensive and not written in C#.

If you do, consider giving .NET a try and reading the linked content if you're interested - it might sway your opinion towards more positive outlook :)



> Surely you're not claiming that .NET's standard library is not extensive and not written in C#.

I’m claiming that MSFT seems to care really about P/Invoke and FFI performance and it was one of the leading reasons for them not to choose green threads. So there has to be something in .NET or C# or win forms or whatever that is influencing the decision.

I’m also claiming that this isn’t a concern for Java. 99.9% of the time you don’t go over FFI and it’s what lead the OpenJdk team to choose virtual threads.

> If you do, consider giving .NET a try

I’d love to, but dealing with async/await is a pain :)



You’ve never used it, so how can you know?


I would damn this, if Async/Await wasn't a good enough (TM) solution for certain problems where Threads are NOT good enough.

Remember: there is a reason why Async/Await was created B E F O R E JavaScript was used for more than sprinkling a few fancy effects on some otherwise static webpages



> Rust can run multiple threads just fine

Rust is also used in environments which don't support threads. Embedded, bare metal, etc.



Strong disagree.

> Rust can run multiple threads just fine--it's not Javascript. As such, it didn't have to use async/await. It could have tried any of a bunch of different solutions. Rust is a systems language, after all.

it allows you to have semantic concurrency where there are no threads available. like, you known, on microncontrollers without an (RT)OS where such a systems programming language is a godsend.

seriously, using async/await on embedded makes so much sense.



async/await is just a different concurrency paradigm with different strengths and weaknesses than threads. Rust has support for threaded concurrency as well though the ecosystem for it is a lot less mature.


Threads are much much slower than async/await.


> backpressure should also be mentioned

I ran into this when I joined a team using nodejs. Misc services would just ABEND. Coming from Java, I was surprised by this oversight. It was tough explaining my fix to the team. (They had other great skills, which I didn't have.)

> error propagation in async/await is not obvious

I'll never use async/await by choice. Solo project, ...maybe. But working with others, using libraries, trying to get everyone on the same page? No way.

--

I haven't used (language level) structured concurrency in anger yet, but I'm placing my bets on Java's Loom Project. Best as I can tell, it'll moot the debate.



> async/await runs in context of one thread,

Not in Rust.



There is a single thread executor crate you can use for that case if it’s what you desire, FWIW.


Yes of course, but the async/await semantics are not designed only to be single threaded. Typically promises can be resumed on any executor thread, and the language is designed to reflect that.


This is completely wrong. You gotta learn about Send and Sync in Rust before you speak.

Rust makes no assumptions and is explicitly designed to support both single and multi threaded executors. You can have non-Send Futures.



I'm fully aware of this, thanks @iknowstuff.

>>>>> Typically promises are designed...

I'm merely saying Rust async is not restricted to single threaded like many other languages design their async to be, because most people coming from Node are going to assume async is always single threaded.

Most people who write their promise implementations make them Send so they work with Tokio or Async-Std.

Relax, my guy. The shitty tone isn't necessary.

EDIT: Ah, your entire history is you just arguing with people. Got it.



It's interesting to see an almost marketing-like campaign to save face for async/await. It is very clear from my experience that it was not only a technical mistake, it also cost the community dearly. Instead of focusing on language features that are actually useful, the Rust effort has been sidetracked by this mess. I'm still very hopeful for the language though, and it is the best thing we've got at the moment. I'm just worried that this whole fight will drag on forever. P.S the AsyncWrite/AsyncRead example looks reasonable, but in fact you can do the same thing with threads/fds as long as you restrict yourself to *nix.


I've used async in firmware before. It was a lifesaver. The generalizations you make are unfounded and are clearly biased toward a certain workload.


I'd love the detail on this, what did it save you from and how did you ensure your firmware does not, say, hang?


Having to implement my own scheduler for otherwise synchronous network, OLED, serial and USB drivers on the same device, as well as getting automatic power state management when the executor ran out of arrived promises.

And a watchdog timer, like always. There's no amount of careful code that absolves you from using a watchdog timer.

For anyone curious, Embassy is the runtime/framework I used. Really well built.



That sounds kind of amazing. Working low level without an OS sounds like exactly the kind of place that Rust's concurrency primitives and tight checking would really be handy. Doing it in straight up C is complicated, and becomes increasingly so with every asynchronous device you have to deal with. Add another developer or two into the mix, and it can turn into a buggy mess rather quickly.

Unless you pull in an embedded OS, one usually ends up with a poor man's scheduler being run out of the main loop. Being able to do that with the Rust compiler looking over your shoulder sounds like it could be a rather massive benefit.



The way to do it in C isn't all that different, is it? You just have explicit state machines for each thing. Yes you have to call thing_process() in the main loop at regular intervals (and probably have each return an am_busy state to determine if you should sleep or not). It's more code but it's easy enough to reason about and probably easier to inspect in a debugger.


Yep, the underlying mechanics have to do the same thing - just swept under another a different rug. I imagine the (potential) advantage as being similar to when we had to do the same thing with JavaScript before promises came along. You would make async calls that would use callbacks for re-entry, and then you would need to pull context out from someplace and run your state machine.

Being able to write chains of asynchronous logic linearly is rather nice, especially if it's complicated. The tradeoff is that your main loop and re-entry code is now sitting behind some async scheduler, and - as you mention - will be more opaque and potentially harder to debug.



You're making w lot of assumptions. Debugging this thing was easy. Try it before you knock it.


thanks. looked that up. for the curious: https://embassy.dev/


I personally agree that it is great that Rust as a language is able to function in an embedded environment. Someone needs to grasp that nettle. I started by writing "concede" in my first sentence there but it's not a concession. It's great, I celebrate it, and Rust is a fairly good choice for at least programmers working in that space. (Whether it's good for electrical engineers is another question, but that's a debate for another day.)

However, the entire Rust library ecosystem shouldn't bend itself around what is ultimately still a niche use case. Embedded uses are still a small fraction of Rust programs and that is unlikely to change anytime soon. I am confident the vast bulk of Rust programs and programmers, even including some people vigorously defending async/await, would find they are actually happier and more productive with threads, and that they would be completely incapable of finding any real, perceptible performance difference. Such exceptions as there may be, which I not only "don't deny" exist but insist exist, are welcome to pay the price of async/await and I celebrate that they have that choice.

But as it stands now, async/await may be the biggest current premature optimization in common use. Though "I am writing a web site that will experience upwards of several dozen hits per minute, should I use the web framework that benches at 1,000,000 reqs/sec or 2,000,000 reqs/sec?" is stiff competition.



> But as it stands now, async/await may be the biggest current premature optimization in common use.

To be fair, isn't the entire point of OP's essay that async/await is useful specifically for reasons that aren't performance? Rather, it is that async/await is arguably more expressive, composable, and correct than threads for certain workloads.

And I have to say I agree with OP here, given what I've experienced in codebases at work: doing what we currently do with async instead with threads would result in not-insubstantial pain.



I disagree with the original essay comprehensively. async/await is less composable (threads automatically compose by their nature; it is so easy it is almost invisible), a tie on expressive (both thread advocates and async advocates play the same game here where they'll use a nice high-level abstraction on "their" side and compare it to raw use of the primitives on the other, both are perfectly capable of high-level libraries making the various use cases easy), and I would say async being more "correct" is generally not a claim that makes much sense to me either way. The correctness/incorrectness comes from things other than the runtime model.

Basically, async/await for historical reasons grew a lot of propaganda around how they "solved" problems with threads. But that is an accidental later post-hoc rationalization of their utility, which I consider for the most part just wrong. The real reason async-await took off is that it was the only solution for certain runtimes that couldn't handle threading. That's fine on its own terms, but is probably the definitive example of the dangers of taking a cost/benefit balance from one language and applying it to other languages without updating the costs and the benefits (gotta do both!). If I already have threads, the solution to their problems is to use the actor model, never take more than one mutex at a time, share memory by communicating instead of communicating by sharing memory, and on the higher effort end, Rust's lifetime annotations or immutable data. I would never have dreamed of trying to solve the problems with async/await.



Honestly that's a fair reply, thanks for the additional nuance to your argument. Upvoted.


> Instead of focusing on language features that are actually useful, the Rust effort has been sidetracked by this mess.

I don't know if you are correct or not (I am not very familiar with Rust) but empirically 9/10 Rust discussions I see nowadays on HN/reddit do revolve around async. It kinda sucks for me because I don't care about async at all, and I am interested in reading stuff about Rust.



> empirically 9/10 Rust discussions I see nowadays on HN/reddit do revolve around async. It kinda sucks for me because I don't care about async at all, and I am interested in reading stuff about Rust.

100% yes. I really feel bad precisely for people in your situation. But, there’s a good reason why you see this async spam on HN:

> (I am not very familiar with Rust)

Once you get past the initial basics (which are amazing, with pattern matching, sum types, traits, RAII, even borrowing can be quite fun), you’ll want to do something “real world”, which typically involves some form of networked IO. That’s when the nightmare begins.

So there’s traditional threaded IO (mentioned in the post), which lets you defer the problem a bit (maybe you can stay on the main thread if eg you’re building a CLI). But every crate that does IO needs to pick a side (async or sync). And so the lib you want to use may require it, which means you need to use async too. There are two ways of doing the same thing - which are API incompatible - meaning if you start with sync and need async later - get ready for a refactor.

Now, you (and the crates you’re using) also have to pick a faction within async, ie which executor to use. I’ve been OOTL a while but I think it’s mostly settled on Tokio these days(?), which probably is for the best. There are even sub-choices about Send-ness (for single-vs-multithreaded executors) and such that also impact API-compatibility.

In either case, a helluva lot of people with simple use-cases absolutely need to worry about async. This is problematic for three main reasons: (1) these choices are front-loaded and hard to reverse and (2) async is more complex to use, debug and understand, and (3) in practice constrains the use of Rust best-practice features like static borrowing.

I don’t think anyone questions the strive for insane performance that asynchronous IO (io_uring, epoll etc) can unlock together with low-allocation runtimes. However, async was supposed to be an ergonomic way to deliver that, ideally without splitting the ecosystem into camps. Otherwise, perhaps just do manual event looping with custom state machines, which doesn’t need any special language features.



Thank you for posting this. Async vs sync is a tough design decision, and the ecosystem of crates in async gets further partitioned by the runtime. Tokio seems to be the leader, but making the async runtime a second-class consideration has just made the refactor risk much higher.

I like async, and think it makes a lot of sense in many use cases, but it also feels like a poisoned pill, where it’s all or nothing.



Maybe it's my lack of experience, but I find it much easier to wrap my head around threads than async/await. Yes, with threads there is more "infrastructure" required, but it's straightforward and easy to reason about (for me). With async/await I really don't fully understand what's going on behind the scenes.

Granted, in my job the needs for concurrency\parallelism tend to be very simple and limited.



Imo this is because threads is a good abstraction. Not perfect, and quite inflexible, but powerful and simple. I would argue green threads are equally simple, too.

Async is implemented differently in different languages. Eg in JS it’s a decent lightweight sugar over callbacks, ie you can pretty much “manually” lower your code from async => promises => callbacks. It also helps that JS is single threaded.

Doing the same in Rust would require you to implement your own scheduler/runtime, and self-referencing “stackless” state machines. It’s orders of magnitude more complex.



Honestly: just ignore them. Just start using Rust! It's a lovely, useful language. HN/reddit are not representative most of the people out there in the real world writing Rust code that solves problems. I am not saying their concerns are invalid, but there is a tendency on these forums to form a self-reinforcing collective opinion that is overfit to the type of people who like to spend time on these forums. Reality is almost always richer and more complicated.


It's true for general programming focused Reddits, but eh, those will always try to poke fun at languages from the extreme perspective. E.g. they are a problem for people new to a language but not for others that have gone through learning the language extensively.

Remember, there are only two types of languages - languages no one uses and languages people complain about. When was the last time you heard Brainfuck doesn't give you good tools to access files?

On /r/rust Reddit, most talk is about slow compile/build times.



If you think that threads are faster than poll(), I would like to know in what use case that happens, because I have never once encountered this in my life.


Do you have any evidence to back up the claim that the async efforts have taken away from other useful async features?

Also, lots of major rust projects depend on async for their design characteristics, not just for the significant performance improvements over the thread-based alternatives. These benefits are easy to see in just about any major IO-bound workload. I think the widespread adoption of async in major crates (by smart people solving real world problems) is a strong indicator that async is a language feature that is "actually useful".

The fight is mostly on hackernews and reddit, and mostly in the form of people who don't need async being upset it exists, because all the crates they use for IO want async now. I understand that it isn't fun when that happens, and there are clearly some real problems with async that they are still solving. It isn't perfect. But it feels like the split over async that is apparent in forum discussions just isn't nearly as wide or dramatic in actual projects.



I have to respectfully disagree. People are allowed to like stuff, it doesn't make it a marketing campaign or a conspiracy. This negativity on async, as if it's just a settled debate that async is a failure and therefor Rust is a failure, feels self-reinforcing.

I've written a fair amount of Rust both with threads and with async and you know what? Async is useful, very often for exactly the reasons the OP mentions, not necessarily for performance. I don't like async for theoretical reasons. I like it because it works. We use async extensively at my job for a lot of lower-level messaging and machine-control code and it works really well. Having well-defined interfaces for Future and Stream that you can pass around is nice. Having a robust mechanism for timers is nice. Cancellation is nice. Tokio is actually really solid (I can't speak to the other executors). I see this "async sucks" meme all over the various programming forums and it feels like a bunch of people going "this thing, that isn't perfect, is therefor trash". Are we not better than this?

This is not to say that async doesn't have issues. Of course it does. I'm not going to enumerate them here, but they exist and we should work to fix them. There are plenty of legitimate criticisms that have been made and can continue to be made about Rust's concurrency story, but async hasn't "cost the community dearly". People who wouldn't use Rust otherwise are using async every single day to solve real problems for real users with concurrent code, which is quite a bit more than can be said for all kinds of other theoretical different implementations Rust could have gone with, but didn't.



It's not a technical mistake, it's a brilliant solution for when you need ultra-low-latency async code. The mistake is pushing it for the vast majority of use-cases where this isn't needed.


The reason it's pushed everywhere is because it only works well if the ecosystem uses it. If the larger rust ecosystem used sync code in most places, async await in rust would be unusable by large swaths of the community.


I think it wouldn't be so painful even as it's pervasive if the ergonomics were far better. Unfortunately, things are still far off on that front. Static dispatch for async functions recently landed in stable, though without a good way to bound the return type. Things like a better pinning API, async drop, interoperability and standardization, dynamic dispatch for async functions, and structured concurrency are open problems, some moving along right now. It'll be a process spanning years.


It's not pushed, it's pulled by the hype. The Rust community got onboard that train and started to async all the things without regards to actual need or consequences.


> It's interesting to see an almost marketing-like campaign to save face for async/await.

It's not, it's just Rust people want to explain choices that led them here, and HN/Reddit crowd is all about async controversy, so more Rust people make blog about it, so more HN/Reddit focus on it. Like any good controversy, it's a self-reinforcing cycle.

> it also cost the community dearly

Citation needed. Having async/await opened a door to new contributors as well, it was a REALLY requested feature, and it made stuff like Embassy possible.

And it made stuff like effects and/or keyword generics a more requested feature.



The big one not mentioned is cancellation. It's very easy to cancel any future. OTOH cancellation with threads is a messy whack-a-mole problem, and a forced thread abort can't be reliable due to a risk of leaving locks locked.

In Rust's async model, it's possible to add timeouts to all futures externally. You don't need every leaf I/O function to support a timeout option, and you don't need to pass that timeout through the entire call stack.

Combined with use of Drop guards (which is Rust's best practice for managing in-progress state), it makes cancellation of even large and complex operations easy and reliable.



It's not easy to cancel any future. It's easy to *pretend* to cancel any future. E.g. if you cancel (drop) anything that uses spawn_blocking, it will just continue to run in the background without you being aware of it. If you cancel any async fs operation that is implemented in terms of a threadpool, it will also continue to run.

This all can lead to very hard to understand bugs - e.g. "why does my service fail because a file is still in use, while I'm sure nothing uses the file anymore"



Yes, if you have a blocking thread running then you have to use the classic threaded methods for cancelling it, like periodically checking a boolean. This can compose nicely with Futures if they flip the boolean on Drop.

I’ve also used custom executors that can tolerate long-blocking code in async, and then an occasional yield.await can cancel compute-bound code.



If you implemented async futures, you could have also instead implemented cancelable threads. The problem is fairly isomorphic. System calls are hard, but if you make an identical system call in a thread or an async future, then you have exactly the same cancellation problem.


I don't get your distinction. Async/await is just a syntax sugar on top of standard syscalls and design patterns, so of course it's possible to reimplement it without the syntax sugar.

But when you have a standard futures API and a run-time, you don't have to reinvent it yourself, plus you get a standard interface for composing tasks, instead of each project and library handling completion, cancellation, and timeouts in its own way.



I don't follow how threads are hard to cancel.

Set some state (like a flag) that all threads have access to.

In their work loop, they check this flag. If it's false, they return instead and the thread is joined. Done.



Cancellation is not worth worrying over in my experience. If an op is no longer useful, then it is good enough if that information eventually becomes visible to whatever function is invoked on behalf of the op, but don't bother checking for it unless you are about to do something very expensive, like start an RPC.


It's been incredibly important for me in both high-traffic network back-ends, as well as in GUI apps.

When writing complex servers it's very problematic to have requests piling up waiting on something unresponsive (some other API, microservice, borked DNS, database having a bad time, etc.). Sometimes clients can be stuck waiting forever, eventually causing the server to run out of file descriptors or RAM. Everything needs timeouts and circuit breakers.

Poorly implemented cancellation that leaves some work running can create pathological situations that eat all CPU and RAM. If some data takes too long to retrieve, and you time out the request without stopping processing, the client will retry, asking for that huge slow thing again, piling up another and another and another huge task that doesn't get cancelled, making the problem worse with each retry.

Often threading is mixed with callbacks for returning results. The un-cancelled callbacks firing after the other part of the application aborted an operation can cause race conditions, by messing up some state or being misattributed to another operation.



Right, this is compatible with what I said and meant. Timeouts that fire while the op is asleep, waiting on something: good, practical to implement. Cancellations that try to stop an op that's running on the CPU: hard, not useful.


I think a better question is "why choose async/await over fibers?". Yes, I know that Rust had green threads in the pre-1.0 days and it was intentionally removed, but there are different approaches for implementing fiber-based concurrency, including those which do not require a fat runtime built-in into the language.

If I understand the article correctly, it mostly lauds the ability to drop futures at any moment. Yes, you can not do a similar thing with threads for obvious reasons (well, technically, you can, but it's extremely unsafe). But this ability comes at a HUGE cost. Not only you can not use stack-based arrays with completion-based executors like io-uring and execute sub-tasks on different executor threads, but it also introduces certain subtle footguns and reliability issues (e.g. see [0]), which become very unpleasant surprises after writing sync Rust.

My opinion is that cancellation of tasks fundamentally should be cooperative and uncooperative cancellation is more of a misfeature, which is convenient at the surface level, but has deep issues underneath.

Also, praising composability of async/await sounds... strange. Its viral nature makes it anything but composable (with the current version of Rust without a proper effect system). For example, try to use async closure with map methods from std. What about using the standard io::Read/Write traits?

[0]: https://smallcultfollowing.com/babysteps/blog/2022/06/13/asy...



For rust, fibers (as a user-space, cooperative concurrency abstraction) would mandate a lot of design choices, such as whether stacks should be implemented using spaghetti stacks or require some sort of process-level memory mapping library, or even if they were just limited to a fixed size stack.

All three of these approaches would cause issues when interacting with code in another language with a different ABI. It can get really complicated, for example, when C code gets called from one fiber and wants to then resume another.

One of the benefits of async/await is the 'await' keyword itself. The explicit wait-points give you the ability to actually reason about the interactions of a concurrent program.

Yielding fibers are a bit like the 'goto' of the concurrency world - whenever you call a method, you don't know if as a side effect it may cause your processing to pause, and if when it continues the state of the world has changed. The need to be defensive when interfacing with the outside world means fibers tend to be better for tasks which run in isolation and communicate by completion.

Green threads, fibers and coroutines all share the same set of problems here, but really user space cooperative concurrency is just shuffling papers on a desk in terms of solving the hard parts of concurrency. Rust async/await leaves things more explicit, but as a result doesn't hide certain side effects other mechanisms do.



> you don't know if as a side effect it may cause your processing to pause, and if when it continues the state of the world has changed

This may be true in JS (or Haskell), but not in Rust, where you already have multithreading (and unrestricted side-effects), and so other code may always be interleaved. So this argument is irrelevant in languages that offer both async/await and threads.

Furthermore, the argument is weak to begin with because the difference is merely in the default choice. With threads, the default is that interleaving may happen anywhere unless excluded, while with async/await it's the other way around. The threading approach is more composable, maintainable, and safer, because the important property is that of non -interference, which threads state explicitly. Any subroutine specifies where it does not tolerate contention regardless of other subroutine it calls. In the async/await model, adding a yield point to a previously non-yielding subroutine requires examining the assumptions of its callers. Threads' default -- of requiring an explicit statement of the desired property of interference is the better one.



I never fully understood the FFI issue. When calling an not-known coroutine safe FFI function you would switch form the coroutine stack to the original thread stack and back. This need not be more expensive than a couple of instructions on the way in and out.

Interestingly, reenabling frame pointers was in the news recently, which would add a similar amount of overhead to every function call. That was considered a more than acceptable tradeoff.



The issue is that the FFI might use C thread-local storage and end up assuming 1-1 threading between the language and C, but if you're using green threads then that won't be the case.


>

For rust, fibers (as a user-space, cooperative concurrency abstraction) would mandate a lot of design choices, such as whether stacks should be implemented using spaghetti stacks or require some sort of process-level memory mapping library, or even if they were just limited to a fixed size stack.

> All three of these approaches would cause issues when interacting with code in another language with a different ABI. It can get really complicated, for example, when C code gets called from one fiber and wants to then resume another.

Java (in OpenJDK 21) is doing it. To be fair, Java had no other choice because there is so much sequential code written in Java, but also the Java language and bytecode compilation makes it easy to implement spaghetti stacks transparently. Given those two things it's obviously a good idea to go with green threads. The same might not apply to other languages.

My personal preference is for async/await, but it's true that its ecosystem bifurcating virality is a bit of a problem.



Java also has the benefit that most of the Java ecosystem is in Java. This makes it easy to avoid the ffi problem since you never leave the VM.


In my opinion, by default fibers should use "full" stacks, i.e. a reasonable amount of unpopulated memory pages (e.g. 2 MiB) with guard page. Effectively, the same stack which we use for threads. It should eliminate all issues about interfacing with external code. But it obviously has performance implications, especially for very small tasks.

Further, on top of this we can then develop spawning tasks which would use parent's stack. It would require certain language development to allow computing maximum stack usage bound of functions. Obviously, such computation would mean that programmers have to take additional restrictions on their code (such as disallowing recursion, alloca, and calling external functions without attributing stack usage), but compilers already routinely compute stack usage of functions, so for pure Rust code it should be doable.

>It can get really complicated, for example, when C code gets called from one fiber and wants to then resume another.

It's a weird example. How would a C library know about fiber runtime used in Rust?

>Yielding fibers are a bit like the 'goto' of the concurrency world - whenever you call a method, you don't know if as a side effect it may cause your processing to pause, and if when it continues the state of the world has changed.

I find this argument funny. Why don't you have the same issue with preemptive multitasking? We live with exactly this "issue" in the threading world just fine. Even worse, we can not even rely on "critical sections", thread's execution can be preempted at ANY moment.

As for `await` keyword, in almost all cases I find it nothing more than a visual noise. It does not provide any practically useful information for programmer. How often did you wonder when writing threading-based code about whether function does any IO or not?



> It’s a weird example. How would a C library know about the fiber runtime used in Rust.

Well, if you green thread were launched on an arbitrary free OS thread (work stealing), then for example your TLS variables would be very wrong when you resume execution. Does it break all FFI? No. But it can cause issues for some FFI in a way that async/await cannot.

> I find this argument funny. Why don't you have the same issue with preemptive multitasking? We live with exactly this "issue" in the threading world just fine. Even worse, we can not even rely on "critical sections", thread's execution can be preempted at ANY moment.

It’s not about critical sections as much. Since the author referenced go to, I think the point is that it gets harder to reason about control flow within your own code. Whether or not that’s true is debatable since there’s not really any implementation of green threads for Rust. It does seem to work well enough for Go but it has a required dedicated keyword to create that green thread to ease readability.

> As for `await` keyword, in almost all cases I find it nothing more than a visual noise. It does not provide any practically useful information for programmer. How often did you wonder when writing threading-based code about whether function does any IO or not?

Agree to disagree. It provides very clear demarcation of which lines are possible suspension points which is important when trying to figure out where “non interruptible” operations need to be written for things to work as intended.



Obviously you would not use operating system TLS variables when your code does not correspond the operating system threads.

They're just globals, anyway - why are we on Hacker News discussing the best kind of globals? Avoid them and things will go better.



Not sure why you’re starting a totally unrelated debate. If you’re pulling in a library via FFI, you have no control over what that library has done. You’d have to audit the source code to figure out if they’ve done anything that would be incompatible with fibers. And TLS is but one example. You’d have to audit for all kinds of OS thread usage (e.g. if it uses the current thread ID as an index into a hashmap or something). It may not be common, but Go’s experience is that there’s some issue and the external ecosystem isn’t going to bend itself over backwards to support fibers. And that’s assuming that these are solved problems within your own language ecosystem which may not be the case either when you’re supporting multiple paradigms.


You've obviously never worked with fibers if you think these are obvious. These problems are well documented and observed empirically in the field.


> In my opinion, by default fibers should use "full" stacks, i.e. a reasonable amount of unpopulated memory pages (e.g. 2 MiB) with guard page.

That's a disaster. If you're writing a server that needs to serve 10K clients concurrently then that's 20GiB of RAM just for the stacks, plus you'll probably want guard pages, and so MMU games every time you set up or tear down a fiber.

The problem with threads is the stacks, the context switches, the cache pressure, the total memory footprint. A fiber that has all those problems but just doesn't have an OS schedulable entity to back it barely improves the situation relative to threads.

Dealing with slow I/O is a spectrum. On one end you have threads, and on the other end you have continuation passing style. In the middle you have fibers/green threads (closer to threads) and async/await (closer to CPS). If you want to get closer to the middle than threads then you want spaghetti stack green threads.



How do fibers solve your cancellation problem? Aren't they more or less equivalent?

(I find fiber-based code hard to follow because you're effectively forced to reason operationally. Keeping track of in-progress threads in your head is much harder than keeping track of to-be-completed values, at least for me)



With fibers you send cancellation signal to a task, then on next IO operation (or more generally yield) it will get cancellation error code with ability to get true result of IO operation, if there is any. Note that it does not mean that the task will sleep until the IO operation gets completed, cancellation signal causes any ongoing IO to "complete" immediately if it's possible (e.g. IIRC disk IO can not be cancelled).

It then becomes responsibility of the task to handle this signal. It may either finish immediately (e.g. by bubbling the "cancellation" error), finish some critical section before that and do some cleanup IO, or it may even outright ignore the signal.

With futures you just drop the task's future (i.e. its persistent stack) maybe with some synchronous cleanup and that's it, you don't give the task a chance to say a word in its cancellation. Hypothetical async Drop could help here (though you would have to rely on async drop guards extensively instead of processing "cancellation errors"), but adding it to Rust is far from easy and AFAIK there are certain fundamental issues with it.

With io-uring sending cancellation signals is quite straightforward (though you need to account for different possibilities, such as task being currently executed on a separate executor thread, or its CQE being already in completion queue), but with epoll, unfortunately, it's... less pleasant.



> Hypothetical async Drop could help here (though you would have to rely on async drop guards extensively instead of processing "cancellation errors"), but adding it to Rust is far from easy and AFAIK there are certain fundamental issues with it.

Wouldn't fiber cancellation be equivalent and have equivalent implementation difficulties? You say you just send a signal to the task, but in practice picking up and running the task to trigger its cancellation error handling is going to look the same as running a future's async drop, isn't it?



Firstly, Rust does not have async Drop and it's unlikely to be added in the foreseeable future. Secondly, cancellation signals is a more general technique than async Drop, i.e. you can implement the latter on top of the former, but not the other way around. For example, with async Drop you can not ignore cancellation event (unless you copy code of your whole task into Drop impl). Some may say that it's a good thing, but it's just an obvious example of cancellation signals being more powerful than hypothetical async Drop.

As for implementation difficulties, I don't think so. For async Drop you need to mess with some fundamental parts of the Rust language (since Futures are "just types"), while fiber-based concurrency, in a certain sense, is transparent for compiler and implementation complexity is moved to executors.

If you are asking about how it would look in user code, then, yes, they would be somewhat similar. With cancellation signals you would call something like `let res = task_handle.cancell_join();`, while with async Drop you would use `drop(task_future)`. Note that the former also allows to get result from a cancelled task, another example of greater flexibility.



> Keeping track of in-progress threads in your head is much harder than keeping track of to-be-completed values, at least for me

I think that's true for everybody. Our minds barely handle state for sequential code - the explosion of complexity of multiple state-modifying threads is almost impossible to follow.

There are ways to convert "keeping track of in-progress threads" to "keeping track or to-be-completed" values - in particular, Go uses channels as a communication mechanism which explicitly does the latter, while abstracting away the former.



> There are ways to convert "keeping track of in-progress threads" to "keeping track or to-be-completed" values - in particular, Go uses channels as a communication mechanism which explicitly does the latter, while abstracting away the former.

I find the Go style pretty impossible to follow - you have to keep track of which in-progress threads are waiting on which lines, because what will happen when you send to a given channel depends on what was waiting to receive from that channel, no? The only way of doing this stuff that I've ever found comprehensible is iteratees, where you reify the continuation step as a regular value that runs when you call it explicitly.



Because stackful fibers suck for low-level code. See Gor Nishanov's review for the C++ committee http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2018/p136... (linked from https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=10... ). It even sums things up nicely: DO NOT USE FIBERS!


To be clear, not everybody agrees with Gor and although they still don't have much traction, stackful coroutines are still being proposed.

Most (but admittedly not all) of Gor issues with stackful coroutines are due to current implementations being purely a library feature with no compiler support. Most of the issues (context switch overhead, tread local issues) can be solved with compiler support.

The issue with lack of support in some existing libraries is unfair, neither do async/await and in fact it would be significantly harder to add support for those as they would require a full rewrite. The take away here is not to try to make M:N fully transparent, not avoid it completely.

The issue with stack usage is real though, split stacks is only a panacea and OS support would be required.

edit: for the other side of the coin, also from microsoft, about issues with the async/await model in Midori (an high performance single-address-space OS): https://joeduffyblog.com/2015/11/19/asynchronous-everything/

They ended up implementing the equivalent of stackful coroutines specifically for performance.



> The take away here is not to try to make M:N fully transparent

But the whole point of the 'Virtual processors' or 'Light-weight process' patterns as popularized by Golang and such is to try and make M:N transparent to user code.



I mean fully transparent to pre-existing code. Go had the benefit of having stackful coroutines from day one, but when retrofitting to an existing language, expecting the whole library ecosystem to work out of the box without changes is a very high bar.


That paper has some significant flaws:

In section 2.3.1, "Dangers of N:M model", the "danger" lies in using thread-local storage that was built for OS threads, without modification, for stackful fibers. The bottom line here should have been "don't do that", not that stackful fibers are dangerous. Obviously, any library that interacts with the implementation mechanism for concurrency must be built for the mechanism actually used, not a different one.

In section 2.3.2, "Hazards of 1:N model", again points out the dangers of making such blind assumptions, but its main point is that blocking APIs must be blocking on a fiber level, not on the host thread level -- that is, it even proposes the solution for the problem mentioned, then totally ignores that solution. Java's approach (even though, IIRC, N:M) does exactly that: "rewrite it in Java". This is one place where I hope "rewrite it in Rust" becomes more than a meme in the future and actually becomes a foundation for stackful fibers.

Then there are a bunch of case studies that show that you cannot just sprinkle some stackful fibers over an existing codebase and hope for it to work. Surprise: You can't do that with async/await either. "What color is my function" etc.

I'm still hoping for a complete, clearly presented argument for why Java can do it and Rust cannot. (Just for example: "it needs a GC" could be the heart of such an argument).



Somewhere there's a blog from the old Sun about how M:N threading was a disaster in the Solaris


> Just for example: "it needs a GC" could be the heart of such an argument

Rust can actually support high-performance concurrent GC, see https://github.com/chc4/samsara for an experimental implementation. But unlike other languages it gives you the option of not using it.



> My opinion is that cancellation of tasks fundamentally should be cooperative and uncooperative cancellation is more of a misfeature, which is convenient at the surface level, but has deep issues underneath.

Cooperative cancellation can be pretty annoying with mathematical problems. You can have optimization algorithms calling rootfinding problems calling ODE integrators and any of those can spin for a very long time and you need be threading through cancellation tokens everywhere and the numerical framework generally don't support it. You can and should use iteration counts in all the algorithms but once you're dealing with nested algorithms that can only guarantee that your problem will stop sometime this year, not that it stops within 5 seconds. With these problems I can promise that I'm just doing: lots of math, allocations with their associated page faults, no I/O, writing strings to a standard library Queue object for logging that are handled back in the Main thread that I'm never going to cancel (and whatever other features you think I might need -- which I haven't needed for years now -- I'd be happy to ship that information back to the main thread on a Queue). It feels like that problem should be solvable in the 21st century to me without making me thread cancellation tokens everywhere and defensively code against spinning without checking the token (where I can make mistakes and cause bugs, which I guess you'll just blame me for).



> I think a better question is "why choose async/await over fibers?

> there are different approaches for implementing fiber-based concurrency, including those which do not require a fat runtime built-in into the language

This keynote below is Ruby so the thread/GVL situation is different than for Rust, but is that the kind of thing you mean?

https://m.youtube.com/watch?v=qKQcUDEo-ZI

I think it makes a good case that async/await is infectious and awkward, and fibers (at least as implemented in Ruby) is quite simply a better paradigm.



> I think it makes a good case that async/await is infectious and awkward, and fibers (at least as implemented in Ruby) is quite simply a better paradigm.

As someone who has done fibers development in Ruby, I disagree.

CRuby has the disadvantage of a global interpreter lock. This means that parallelism can only be achieved via multiple processes. This is not the case in Rust, where have access to do true parallelism in a single process.

Second, this talk is not arguing for use of fibers as much as it is arguing as using fibers to rig up a bespoke green threads-like system for a specific web application server, and advocating for ruby runtime features to make the code burden on them of doing this lighter.

Ruby has a global interpreter lock, so even though it uses native threads only one of them can be executing ruby code at a time. Fibers have native stacks, so they have all the resource requirements of a thread sans context switching - but the limitations from the GIL actually mean you aren't _saving_ context switching by structuring your code to use fibers in typical (non "hello world" web server) usage.



> CRuby has the disadvantage of a global interpreter lock. This means that parallelism can only be achieved via multiple processes.

Even for Ruby code (and not native code running in a Ruby app) this is false since the introduction of Ractors, which are units within a single process that communicate without shared mutable (user) state, as the GVL is only “global” to a Ractor, not a process.

(Also worth noting, given the broader topic here, that Ruby has an available async implementation built on top of fibers which doesn't rely on special syntax and allows code to be blocking or non-blocking depending on context.)



I'm sorry but you seem to have missed the point I'm referring to, even though you have directly quoted it.

None of your paragraphs are relevant to async await vs fiber: async await requires you to put keywords in all sorts of unexpected places, fibers do not.

> CRuby has the disadvantage of a global interpreter lock. This means that parallelism can only be achieved via multiple processes.

I am very well cognisant of this fact, but this bears absolutely no relationship with async await vs fibers: one is sweeping callbacks on an event loop under leaky syntactic sugar, the other is cooperative coroutines; they're all on the same Ruby thread, the GVL simply does not intervene.

> this talk is not arguing for use of fibers as much as it is arguing as using fibers to rig up a bespoke green threads-like system for a specific web application server, and advocating for ruby runtime features to make the code burden on them of doing this lighter.

That sounds like a very cynical view of it. I believe there are arguments made in this talk that are entirely orthogonal to how Falcon benefits from fibers.

> Fibers have native stacks, so they have all the resource requirements of a thread

For starters fibers don't have the resource requirements of a thread because they're not backed by native OS threads, while CRuby creates an OS thread for each Ruby VM thread (until MaNy lands, but even with M:N they'd still be subject to the GVL).

Are you arguing for a stackless design like Stackless Python? Goroutines are stack-based too and the benefit of coroutines (and a M:N design) is very apparent. Anyway async await is stack-based too so I don't see how this is relevant: if you have 1M requests served at the same time truly concurrently you're going to have either 1M event callbacks or 1M fibers; that's 1 million stacks either way.

I seem to gather through what I read as a bitter tone that your experience with fibers was not that good. I appreciate another data point but I can't seem to reconcile it vs async await.

But really, the only reason I brought up threads is because the GVL makes threads less useful in CRuby than in Rust (or Go since I mentioned it).



> I am very well cognisant of this fact, but this bears absolutely no relationship with async await vs fibers

The ruby environment does not have async/await, and has a desire for experimentation with user mode concurrency systems due to the impact of GIL on the utility of native threads. Running under an interpreter with primarily heap-allocated objects and a patchable runtime also means that you have less impact from switching to such a model as an application developer.

There were discussions that this somehow pertained to a wider discussion on async/await vs fibers, specifically in rust which does have proper utilization of threads and does have async/await support. Since rust is also operating based on native code, it would need to have changes such as putting fiber suspension logic in all I/O calls between itself and the underlying OS to support developer-provided concurrency.



I only scrolled the video, but it sounds similar, yes. Though, implementation details would probably vary significantly, since Rust is a lower-level language.


Could you share an example of a fiber implementation not relying a fat runtime built in the language?


https://github.com/Xudong-Huang/may

The project has some serious restrictions and unsound footguns (e.g. around TLS), but otherwise it's usable enough. There are also a number of C/C++ libraries, but I can not comment on those.



For example https://github.com/creationix/libco

The required thing is mostly just to dump your registers on the stack and jump.



I do think implementations like that are not particularly useful though.

You want a runtime to handle and multiplex blocking calls - otherwise if you perform any blocking calls (mostly I/O) in one fiber, you block everything - so what use are those fibers ?



The answer is the same as in async Rust, right? "Don't do that."

If you wanted to use this for managing a bunch of I/O contexts per OS thread then you would need to bring an event loop and a bunch of functions that hook up your event loop to whatever asynchronous I/O facilities your OS provides. Sort of like in async Rust.



The sole advantage of async/await over fibers is the possibility to achieve ultra-low-latency via the compiler converting the async/await into a state machine. This is important for Rust as a systems language, but if you don't need ultra-low-latency then something with a CSP model built on fibers, like Goroutines or the new Java coroutines, is much easier to reason about.


Fibers and async/await are backed by the same OS APIs, they can achieve more or less the same latency. The main advantage of async/await (or to be more precise stackless coroutines) is that they require less memory for task stacks since tasks can reuse executor's stack for non-persistent part of their stack (i.e. stack variables which do not cross yield points). It has very little to do with latency. At most you can argue that executor's stack stays in CPU cache, which reduces amount of cache misses a bit.

Stackless coroutines also make it easier to use parent's stack for its children stacks. But IMO it's only because compilers currently do not have tools to communicate maximum stack usage bound of functions to programming languages.



> communicate maximum stack usage bound of functions

This would be useful in all sorts of deeply embedded code (as well as more exotic things, such as coding for GPU compute). Unfortunately it turns out to be unfeasible when dealing with true reentrant functions (e.g. any kind of recursion) or any use of FFI, dynamic dispatch etc. etc. So it can only really be accomplished when dealing with near 'leaf' code, where stack usage is expected to be negligible anyway.



>Fibers and async/await are backed by the same OS APIs, they can achieve more or less the same latency.

The key requirement for ultra-low-latency software is minimising/eliminating dynamic memory allocation, and stackless coroutines allow avoiding memory allocation. For managed coroutines on the other hand (e.g. goroutines, Java coroutines) as far as I'm aware it's impossible to have an implementation that doesn't do any dynamic memory allocation, or at least there aren't any such implementations in practice.



Yes, it's what I wrote about in the last paragraph. If you can compute maximum stack size of a function, then you can avoid dynamic allocation with fibers as well (you also could provide stack size manually, but it would break horribly if the provided number is wrong). You are right that such implementations do not exist right now, but I think it's technically feasible, as demonstrated by tools such as https://github.com/japaric/cargo-call-stack The main stumbling block here is FFI, historically shared libraries do not have any annotations about stack usage, so functions with bounded stack usage would not be able to use even libc.


> Fibers and async/await are backed by the same OS APIs

async/await doesn't require any OS APIs, or even an OS at all.

You can write async rust that runs on a microcontroller and poll a future directly from an interrupt handler.

And there's a huge advantage to doing so, too: you can write out sequences of operations in a straightforward procedural form, and let the compiler do the work of turning that into a state machine with a minimal state representation, rather than doing that manually.



Sigh... It gets tiring to hear about embedded from async/await advocates as if it's a unique advantage of the model. Fibers and similar mechanisms are used routinely in embedded world as demonstrated by various RTOSes.

Fibers are built on yielding execution to someone else, which is implemented trivially on embedded targets. Arguably, in a certain sense, fibers can be even better suited for embedded since they allow preemption of task by interrupts at any moment with interrupts being processed by another task, while with async/await you have to put event into queue and continue execution of previously executed future.



Exactly. Proof by implementation: asio is a very well known and well regarded C++ event loop library and can be transparently used with old-school hand-written continuations, more modern future/promise, language based async/await coroutines and stackful coroutines (of the boost variety).

The event loop and io libraries are in practice the same for any solution you decide, everything else is just sugar on top and in principle you can mix and match as needed.



> The key requirement for ultra-low-latency software is minimising/eliminating dynamic memory allocation

First, that is only true in languages -- like C++ or Rust -- where dynamic memory allocation (and deallocation) is relatively costly. In a language like Java, the cost of heap allocation is comparable to stack allocation (it's a pointer bump).

Second, in the most common case of writing high throughput servers, the performance comes from Little's law and depends on having a large number of threads/coroutines. That means that all the data required for the concurrent tasks cannot fit in the CPU cache, and so switching in a task incurs a cache-miss, and so cannot be too low-latency.

The only use-cases where avoiding memory allocation could be useful and achieving very low latency is possible are when the number of threads/coroutines is very small, e.g. generators.

The questions, then, are which use-case you pick to guide the design, servers or generators, and what the costs of memory management are in your language.



>Second, in the most common case of writing high throughput servers,

High-throughput servers are not ultra-low-latency software; they prioritise throughput over latency. Ultra-low-latency software is stuff like audio processing, microcontrollers and HFT. There's a trade-off between throughput and latency.



Not here. You don't trade off latency because you cannot reduce it below a cache-miss per context-switch, anyway if your working set is not tiny. The point is that if you have lots of tasks then your latency is has a lower bound (due to hardware limitations) regardless of the design.

In other words, if your server serves some amount of data that is larger than the CPU cache size and can be accessed at random, there is some latency that you have to pay, and so many micro-optimisations are simply ineffective even if you want to get the lowest latency possible. Incurring a cache miss and allocating memory (if your allocation is really fast) and even copying some data around isn't significantly slower than just incurring a cache miss and not doing those other things. They matter only when you don't incur a cache miss, and that happens when you have a very small number of tasks whose data fits in the cache (i.e. a generator use-case and not so much a server use-case).

Put in yet another way, some considerations only matter when the workload doesn't involve many cache misses, but a server workload virtually always incurs a cache-miss when serving a new request, even in servers that care mostly about latency. In general, in servers you're then working in the microsecond range, anyway, and so optimisations that operate at the nanosecond range are not useful.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com