无需数据库即可构建高可用的 Web 服务

无需数据库即可构建高可用的 Web 服务
Building a highly-available web service without a database

原始链接: https://blog.screenshotbot.io/2024/08/10/building-a-highly-available-web-service-without-a-database/

本文讨论了一种 Web 应用程序开发的新方法，该方法简化了选择数据库、Web 服务框架（有时还包括前端框架）的传统方法。该方法不是将数据库和应用程序分开，而是将两者集成在一起，利用系统 RAM 的整个内存作为数据库。数据序列化为 SQL 查询变得不必要，从而消除了对多个前端服务器和复杂数据库索引的需要。此外，它还降低了后台作业、并发问题和数据库恢复过程的复杂性。这种方法的一个关键特性是能够通过定期快照来维护持久状态，并结合使用事务来记录自上次快照以来的修改。这确保了一致性，并能够在应用程序崩溃和重新启动时快速恢复。此外，该方法支持新型编码技术，允许在内存中存储和利用闭包，从而减少跨页面转换序列化对象的需要。与传统方法相比，这可以加快迭代速度、减少调试并提高可用性。随着公司的发展，该架构可以扩展到包括 Raft 共识协议和分片，以提高可用性和水平扩展。该架构的作者利用 Common Lisp、bknr.datastore、BKNR Cluster、Braft、Raft 和 Amazon Elastic File System (EFS) 来实现他们的设计。他们声称，这种架构由于其简单性以及易于测试、维护和部署而对新初创公司特别有利。

HashiCorp 的 Nomad、Consul 和 Vault 构成了一个独特的架构，由演讲者（Nomad 的维护者）设计和维护。虽然设计看起来很复杂，但一旦掌握，开发者体验就会很顺畅。该架构允许自定义特定于应用程序的索引和查询功能。写入状态涉及简单的函数调用，确保防止过时读取或写入偏差。该架构避免了通常隐藏在 RDBMS 或其他外部存储中的常见“魔法”。然而，发言人警告不要在基础设施领域之外的新初创公司中使用这种架构，并指出，他们构建自己的数据库，需要仔细选择节点间 RPC、磁盘持久性、内存中事务状态存储等原语。其他的。由于新实体和现有实体处理之间的潜在差异，升级带来了挑战。作者与 rqlite 分享了他们的个人历史，rqlite 是一个早期项目，使用与 Nomad 相同的 Raft 库。他们揭示了 rqlite 10 年的演变，最初采用内存状态方法，然后过渡到磁盘 SQLite，提供类似于写入 RAM 的速度。作者承认与此架构相关的升级挑战，但指出在实践中 rqlite 很少出现问题。在 rqlite 的整个生命周期中，很少引入新的 Raft Entry 类型，这会导致偶尔出现升级问题。为了解决这个问题，作者建议发布一个初步版本，该版本包含新类型，但在升级到完整功能版本之前避免实际写入。最后，作者指出，RAM 成本持续大幅下降，而 CPU 性能显着提高，这使得传统存储系统的吸引力下降。这激发了创新数据管理方法的创建，如 OP 中讨论的方法。

原文

If you’ve ever built a web service or a web app, you know the drill: pick a database, pick a web service framework (and in today’s day and age, pick a front-end framework, but let’s not get into that).

This has been the case for several decades now, and people don’t stop to question if this is still the best way to build a web app. Many things have changed in the last decade:

Disk is a lot faster (NVMe)
Disk is a lot more robust (EBS/EFS etc.)
RAM is super cheap, for most startups you could probably fit all your data in RAM
You can rent a machine with hundreds of cores if your heart desires.

This was not the case when I first worked at a Rails startup in 2010. But most importantly, there’s one new very important change that’s happened in the last decade:

In this blog post, we’re going to break down a new architecture for web development. We use it successfully for Screenshotbot, and we hope you’ll use it too.

I’ll break this blog post into three parts: Explore, Expand and Extract, obviously referencing Kent Beck‘s 3X. Your needs are going to vary in each of these stages of your startup, and I’m going to demonstrate how you use the architecture in all three phases.

Explore

So you’re a new startup. You’re iterating on a product, you have no idea how people are going to use it, or even if they’re going to use it.

For most startups today, this would mean you’ll pick Rails or Django or Node or some such, backed with a MySQL or PostgreSQL or MongoDB or some such.

“Keep it simple silly,” you say, and this seems simple enough.

But is this as simple as possibly can be? Could we make it simpler? What if the web service and the database instance were exactly one and the same? I’m not talking about using something like SQLite where your data is still serialized, I’m saying what if all the memory in your RAM is your database.

Imagine all the wonderful things you could build if you never had to serialize data into SQL queries. First, you don’t need multiple front-end servers talking to a single DB, just get a bigger server with more RAM and more CPU if you need it. What about indices? Well, you can use in-memory indices, effectively just hash-tables to lookup objects. You don’t need clever indices like B-tree that are optimized for disk latency. (In fact, you can use some indices that were probably not possible with traditional databases. One such index using functional collections was critical to the scalability of Screenshotbot.)

You also won’t need special architectures to reduce round-trips to your database. In particular, you won’t need any of that Async-IO business, because your threads are no longer IO bound. Retrieving data is just a matter of reading RAM. Suddenly debugging code has become a lot easier too.

You don’t need any services to run background jobs, because background jobs are just threads running in this large process.

You don’t need crazy concurrency protocols, because most of your concurrency requirements can be satisfied with simple in-memory mutexes and condition variables.

But then comes the important part: how do you recover when your process crashes? It turns out that answer is easy, periodically just take a snapshot of everything in RAM.

Hold on, what if you’ve made changes since the last snapshot? And this is the clever bit: you ensure that every time you change parts of RAM, we write a transaction to disk. So if you have a line like foo.setBar(2), this will first write a transaction that says we’ve changed the bar field of foo to 2, and then actually set the field to 2. An operation like new Foo() writes a transaction to disk to say that a Foo object was created, and then returns the new object.

And so, if your process crashes and restarts, it first reloads the snapshot, and replays the transaction logs to fully recover the state. (Notice that index changes don’t need to be part of the transaction log. For instance if there’s an index on field bar from Foo, then setBar should just update the index, which will get updated whether it’s read from a snapshot, or from a transaction.)

Finally, this architecture enables some new kind of code that couldn’t be written before. Since all requests are being served by the same process, which usually doesn’t get killed, it means you can store closures in memory that can be used to serve pages. For example on Screenshotbot, if you ever see a https://screenshotbot.io/n/nnnnnnn URL, it’s actually a closure on the server, where nnnnnnn maps to an internal closure. But amazingly, this simple change means we don’t need to serialize objects across page transitions. The closure has references to the objects, so we don’t need to pass around object-ids across every single request. In Javascript, this might hypothetically look like:

function renderMyObject(obj) {
   return <html>...
            <a href=(() => obj.delete()) >Delete</a>
            ...
          </html>
}

What this all means is that you can iterate quickly. If you have to debug, there’s exactly one service that you need to debug. If you need to profile code, there’s exactly one service you need to profile (no more MySQL slow query logs). There’s exactly one service to monitor: if that one service goes down the site certainly goes down, but since there’s only one service and one server, the probability of failure is also much lower. If the server dies, AWS will automatically bring up a new server to replace it within a few minutes.

It’s also a lot easier to write test code, since you no longer have to mock out databases.

Expand

So you’re moving fast, iterating, and building out ideas, and slowly getting customers along the way.

Then one day, you get a high-profile customer. Bingo, you’re now in the Expand phase of your startup.

But there’s a catch: this high-profile customer requires 99.999% availability.

Surely, the architecture we just described cannot handle this. If the server goes down, we would need to wait several minutes for AWS to bring it back up. Once it’s back up, we might wait several minutes for our process to even restore the snapshot from disk. Even re-deploys are tricky: restarting the service can bring down the server for multiple minutes.

And this is where the Raft Consensus Protocol comes in.

Raft is a wonderful algorithm and protocol. It takes your finite state machine (your web server/database), and essentially replicates the transaction log. So now, we can take our very simple architecture and replicate it across three machines. If the leader goes down, then a new leader is elected within seconds and will continue to serve requests.

We’ve just made our simple little service into a full-fledged highly-available database, without fundamentally changing how developers write code.

With this mechanism, you can also do a rolling deploy without ever bringing the server down. (Although we rarely restart our server processes, more on that in a moment.) Because there’s just one service, it’s also easy to calculate your availability guarantees.

Extract

So your startup is doing well, and you have thousands of large customers.

To be honest, Screenshotbot is not at this stage, I’ll talk about where we are in a moment. But we’re preparing for this possibility, with monitoring in place for predicted bottlenecks.

The solution here is something large companies already do with their databases: sharding. You can break up your web services into shards, each shard being its own cluster. In particular, at Screenshotbot we already do this: each of our enterprise customers get their own dedicated cluster. (Fun story: Meta switched to Raft to handle replication for each of its MySQL clusters, so we’re essentially doing the same thing but without using a separate database.)

I don’t know what else to expect, since I’m more of a solve-today’s-problem kind of person. The main bottleneck I expect to see is scaling the commit-thread. The read threads parallelize beautifully. There’s one commit-thread that’s applying each transaction one at a time. It turns out the disk latency is irrelevant for this, since the Raft algorithm will just commit multiple transactions together to disk. My main concern is that the CPU cost for applying the transactions will exceed the single core performance. I highly doubt that I would ever see this, but it’s a possibility. At this point we could profile the cost of commits and improve it (for instance, move some of the work out of the transaction thread), or we could just figure out sharding. I’ll probably write another blog post when that happens.

Our Stack

Now that I’ve described the idea to you, let me tell you about our stack, and why it turned out to be so suitable for this architecture.

We use Common Lisp. My initial implementation of Screenshotbot did use MySQL, but I quickly swapped it out for bknr.datastore exactly because handling concurrency with MySQL was hard and Screenshotbot is a highly concurrent app. BKNR Datastore is a library that handles the architecture described in the Explore section, but built for Common Lisp. (There are similar libraries for other languages, but not a whole lot of them.)

Common Lisp is also heavily multi-threaded, and this is going to be crucial for this architecture since your web requests are being handled by threads in a single process. Ruby or Python would be disqualified by this requirement.

We also use the idea of closures that I mentioned earlier. But this means we can’t keep restarting the server frequently (if you restart the server, you lose the closures). So reloading code is just hot-reloading code in the running process. It turns out Common Lisp is excellent at this: so much so that a large part of the standard is all about handling reloading code. (For instance, if the class definition changes, how do you update objects of that class? There’s a standard for it.)

Occasionally, we do restart the servers. Currently, it looks like we only restart the servers about once every month or two months. When we need to do this, we just do a rolling restart with our Raft cluster. We use a cluster of 3 servers per installation, which allows for one server to go down. We don’t use Kubernetes, we don’t need it (at least, not yet).

For the Raft implementation, we wrote our own custom library built on top of bknr.datastore. We built and open-sourced bknr.cluster, that under the hood uses the fantastic Braft library from Baidu. Braft is super solid, and I can highly recommend it. Braft also handles background snapshots, which means while we’re taking snapshots, the server can still continue serving requests.

To store image files, or blobs that shouldn’t be part of the datastore, we use EFS (a highly available NFS) that is shared between the three servers. EFS is easier to work with than S3, because we don’t have to handle error conditions. EFS also makes our code more testable, since we aren’t interacting with an external server, and just writing to disk.

How well does this scale? We have a couple of big enterprise customers, but one especially well-known customer. Screenshotbot runs on their CI, so we get API requests 100s of times for every single commit and Pull Request. Despite this, we only need a 4-core 16GB machine to serve their requests. (And similar machines for the replicas, mostly running idle.) Even with this, the CPU usage maxes out at 20%, but even then most of that comes from image processing, so we have a lot of room to scale before we need to bump up the number of cores.

Summary

I think this architecture is excellent for new startups, and I’m hoping more companies will adopt it. Obviously, you’ll need to build out some of the tooling we’ve built out for the language of your choice. (Although, if you choose to use Common Lisp, it’s all here for you to use, and all open-source.)

We’re super grateful to the folk behind bknr.datastore, Braft and Raft, because without their work we wouldn’t be able to do any of this.

If you think this was useful or interesting, I would appreciate it if you could share it on social media. Please reach out to me at [email protected] if you have any questions.