![]() |
|
![]() |
| *Between two distributed systems that don't share the same transactionality domain or are not logically monotone.
It's easy to see that moving data from one row to another is doable in a clustered database, and could be interpreted as a message being delivered. The point is that you can get exactly once delivery, if your whole system is either idempotent, or you can treat the distributed system as one single unit that can be rolled back together (i.e. side effect free wrt. some other system outside the domain). Both are cases of some form of logical monotonicity, idempotence is easier to see, but transactionality is also based on monotonicity through the used WALs and algorithms like Raft. The article should really mention CALM (Consistency as logical monotnicity), it's much easier to understand and a more fundamental result than CAP. https://arxiv.org/pdf/1901.01930 |
![]() |
| You can have at-least-once-or-fail though. Just send a message and require an ack. If the ack doesn’t come after a timeout, try n more times, then report an error, crash, or whatever. |
![]() |
| But it’s mostly the processing that is interesting. Like this message I write now might see several retransmissions until you read it, but it’s a single message. |
![]() |
| Even comparatively senior engineers make this mistake relatively often. If you're a SaaS dealing with at most 100GB of analytical data per customer, (eventually, sharded) postgres is all you need. |
![]() |
| I think a lot has changed since 2013 when this article was written. Back then cloud services were less mature and there were more legitimate cases when you had to care about the theoretical distributed aspects of your backend architecture.. although even then these were quickly disappearing, unless you worked at a few select bigtech companies like the FAANGs.
But today, in 2024, if you just standardize on AWS, you can pretty much use one of the AWS services for pretty much anything. And that AWS service will be already distributed in the backend, for free, in terms of you not having to worry about it. Additionaly, it will be run by AWS engineers for you, with all sorts of failovers, monitoring, alerting, etc, behind the scenes that will be much better than what you can build. So these days, for 99% of people it doesn't really make sense to worry too much about this theoretical stuff, like Paxos, Raft, consistency, vector clocks, byzantine failures, CAP, distributed locks, distributed transactions, etc. And that's good progress, it has been abstracted away behind API calls. I think it's pretty rational to just build on top of AWS (or similar) services, and accept that it's a black box distributed system that may still go down sometimes, but it'll still be 10-100x more reliable then if you try to build your own distributed system. Of course, even for the 99%, there are still important practical things to keep in mind, like logging, debugging, backpressure, etc. Another thing I learned is that some concepts, such as availability, are less important, and less achievable, then they seem on paper. On paper it sounds like a worthy exercise to design systems that will fail over and come back automatically if a component fails, with only a few seconds downtime. Magically, with everything working like a well oiled machine. In practice this is pretty much never works out, because there are componenets of the system that the designed didn't think of, and it's those that will fail and bring the system down. Eg. see the recent Crowdstrike incident. And, with respect to importance of availability, of the ~10 companies I worked at in the past ~20 years there's wasn't a single one that couldn't tolerate a few hours of downtime with zero to minimal business and PR impact (people are used to things going down a couple of times a year). I remember having outages at a SaaS company I worked for 10 years ago, no revenue was coming in for a few days, but then people would just spend more in the following day. Durability is more important, but even that is less important in practice then we'd like to think [1]. The remaining 1% of engineers, who get to worry about the beuatiful theoretical AND practical aspects of distributed computing [because they work at Google or Facebook or AWS] should count themselves lucky! I think it's one of the most interesting fields in Computer Science. I say this as somebody who deeply cares/cared about theoretical distributed computing, I wrote distributed databases [2] and papers in the past [3]. But working in industry (also managing a Platform Eng team), I cannot recall the last time I had to worry about such things. [1] PostgreSQL used fsync incorrectly for 20 years - https://news.ycombinator.com/item?id=19119991 |
![]() |
| The downside is you are walking into an inviting mono/duopoly. Things like this and cloudflare are amazing until the wind changes and the entire internet has a failure mode |
Idempotence, CRDTs, WALs, Raft, they are all special cases of the CALM principle.
[1]: https://arxiv.org/pdf/1901.01930