If you’ve ever built a web service or a web app, you know the drill: pick a database, pick a web service framework (and in today’s day and age, pick a front-end framework, but let’s not get into that).
This has been the case for several decades now, and people don’t stop to question if this is still the best way to build a web app. Many things have changed in the last decade:
- Disk is a lot faster (NVMe)
- Disk is a lot more robust (EBS/EFS etc.)
- RAM is super cheap, for most startups you could probably fit all your data in RAM
- You can rent a machine with hundreds of cores if your heart desires.
This was not the case when I first worked at a Rails startup in 2010. But most importantly, there’s one new very important change that’s happened in the last decade:
In this blog post, we’re going to break down a new architecture for web development. We use it successfully for Screenshotbot, and we hope you’ll use it too.
I’ll break this blog post into three parts: Explore, Expand and Extract, obviously referencing Kent Beck‘s 3X. Your needs are going to vary in each of these stages of your startup, and I’m going to demonstrate how you use the architecture in all three phases.
Explore
So you’re a new startup. You’re iterating on a product, you have no idea how people are going to use it, or even if they’re going to use it.
For most startups today, this would mean you’ll pick Rails or Django or Node or some such, backed with a MySQL or PostgreSQL or MongoDB or some such.
“Keep it simple silly,” you say, and this seems simple enough.
But is this as simple as possibly can be? Could we make it simpler? What if the web service and the database instance were exactly one and the same? I’m not talking about using something like SQLite where your data is still serialized, I’m saying what if all the memory in your RAM is your database.
Imagine all the wonderful things you could build if you never had to serialize data into SQL queries. First, you don’t need multiple front-end servers talking to a single DB, just get a bigger server with more RAM and more CPU if you need it. What about indices? Well, you can use in-memory indices, effectively just hash-tables to lookup objects. You don’t need clever indices like B-tree that are optimized for disk latency. (In fact, you can use some indices that were probably not possible with traditional databases. One such index using functional collections was critical to the scalability of Screenshotbot.)
You also won’t need special architectures to reduce round-trips to your database. In particular, you won’t need any of that Async-IO business, because your threads are no longer IO bound. Retrieving data is just a matter of reading RAM. Suddenly debugging code has become a lot easier too.
You don’t need any services to run background jobs, because background jobs are just threads running in this large process.
You don’t need crazy concurrency protocols, because most of your concurrency requirements can be satisfied with simple in-memory mutexes and condition variables.
But then comes the important part: how do you recover when your process crashes? It turns out that answer is easy, periodically just take a snapshot of everything in RAM.
Hold on, what if you’ve made changes since the last snapshot? And this is the clever bit: you ensure that every time you change parts of RAM, we write a transaction to disk. So if you have a line like foo.setBar(2)
, this will first write a transaction that says we’ve changed the bar
field of foo
to 2, and then actually set the field to 2. An operation like new Foo()
writes a transaction to disk to say that a Foo object was created, and then returns the new object.
And so, if your process crashes and restarts, it first reloads the snapshot, and replays the transaction logs to fully recover the state. (Notice that index changes don’t need to be part of the transaction log. For instance if there’s an index on field bar
from Foo
, then setBar
should just update the index, which will get updated whether it’s read from a snapshot, or from a transaction.)
Finally, this architecture enables some new kind of code that couldn’t be written before. Since all requests are being served by the same process, which usually doesn’t get killed, it means you can store closures in memory that can be used to serve pages. For example on Screenshotbot, if you ever see a https://screenshotbot.io/n/nnnnnnn
URL, it’s actually a closure on the server, where nnnnnnn maps to an internal closure. But amazingly, this simple change means we don’t need to serialize objects across page transitions. The closure has references to the objects, so we don’t need to pass around object-ids across every single request. In Javascript, this might hypothetically look like:
function renderMyObject(obj) {
return <html>...
<a href=(() => obj.delete()) >Delete</a>
...
</html>
}
What this all means is that you can iterate quickly. If you have to debug, there’s exactly one service that you need to debug. If you need to profile code, there’s exactly one service you need to profile (no more MySQL slow query logs). There’s exactly one service to monitor: if that one service goes down the site certainly goes down, but since there’s only one service and one server, the probability of failure is also much lower. If the server dies, AWS will automatically bring up a new server to replace it within a few minutes.
It’s also a lot easier to write test code, since you no longer have to mock out databases.
Expand
So you’re moving fast, iterating, and building out ideas, and slowly getting customers along the way.
Then one day, you get a high-profile customer. Bingo, you’re now in the Expand phase of your startup.
But there’s a catch: this high-profile customer requires 99.999% availability.
Surely, the architecture we just described cannot handle this. If the server goes down, we would need to wait several minutes for AWS to bring it back up. Once it’s back up, we might wait several minutes for our process to even restore the snapshot from disk. Even re-deploys are tricky: restarting the service can bring down the server for multiple minutes.
And this is where the Raft Consensus Protocol comes in.
Raft is a wonderful algorithm and protocol. It takes your finite state machine (your web server/database), and essentially replicates the transaction log. So now, we can take our very simple architecture and replicate it across three machines. If the leader goes down, then a new leader is elected within seconds and will continue to serve requests.
We’ve just made our simple little service into a full-fledged highly-available database, without fundamentally changing how developers write code.
With this mechanism, you can also do a rolling deploy without ever bringing the server down. (Although we rarely restart our server processes, more on that in a moment.) Because there’s just one service, it’s also easy to calculate your availability guarantees.
Extract
So your startup is doing well, and you have thousands of large customers.
To be honest, Screenshotbot is not at this stage, I’ll talk about where we are in a moment. But we’re preparing for this possibility, with monitoring in place for predicted bottlenecks.
The solution here is something large companies already do with their databases: sharding. You can break up your web services into shards, each shard being its own cluster. In particular, at Screenshotbot we already do this: each of our enterprise customers get their own dedicated cluster. (Fun story: Meta switched to Raft to handle replication for each of its MySQL clusters, so we’re essentially doing the same thing but without using a separate database.)
I don’t know what else to expect, since I’m more of a solve-today’s-problem kind of person. The main bottleneck I expect to see is scaling the commit-thread. The read threads parallelize beautifully. There’s one commit-thread that’s applying each transaction one at a time. It turns out the disk latency is irrelevant for this, since the Raft algorithm will just commit multiple transactions together to disk. My main concern is that the CPU cost for applying the transactions will exceed the single core performance. I highly doubt that I would ever see this, but it’s a possibility. At this point we could profile the cost of commits and improve it (for instance, move some of the work out of the transaction thread), or we could just figure out sharding. I’ll probably write another blog post when that happens.
Our Stack
Now that I’ve described the idea to you, let me tell you about our stack, and why it turned out to be so suitable for this architecture.
We use Common Lisp. My initial implementation of Screenshotbot did use MySQL, but I quickly swapped it out for bknr.datastore exactly because handling concurrency with MySQL was hard and Screenshotbot is a highly concurrent app. BKNR Datastore is a library that handles the architecture described in the Explore section, but built for Common Lisp. (There are similar libraries for other languages, but not a whole lot of them.)
Common Lisp is also heavily multi-threaded, and this is going to be crucial for this architecture since your web requests are being handled by threads in a single process. Ruby or Python would be disqualified by this requirement.
We also use the idea of closures that I mentioned earlier. But this means we can’t keep restarting the server frequently (if you restart the server, you lose the closures). So reloading code is just hot-reloading code in the running process. It turns out Common Lisp is excellent at this: so much so that a large part of the standard is all about handling reloading code. (For instance, if the class definition changes, how do you update objects of that class? There’s a standard for it.)
Occasionally, we do restart the servers. Currently, it looks like we only restart the servers about once every month or two months. When we need to do this, we just do a rolling restart with our Raft cluster. We use a cluster of 3 servers per installation, which allows for one server to go down. We don’t use Kubernetes, we don’t need it (at least, not yet).
For the Raft implementation, we wrote our own custom library built on top of bknr.datastore. We built and open-sourced bknr.cluster, that under the hood uses the fantastic Braft library from Baidu. Braft is super solid, and I can highly recommend it. Braft also handles background snapshots, which means while we’re taking snapshots, the server can still continue serving requests.
To store image files, or blobs that shouldn’t be part of the datastore, we use EFS (a highly available NFS) that is shared between the three servers. EFS is easier to work with than S3, because we don’t have to handle error conditions. EFS also makes our code more testable, since we aren’t interacting with an external server, and just writing to disk.
How well does this scale? We have a couple of big enterprise customers, but one especially well-known customer. Screenshotbot runs on their CI, so we get API requests 100s of times for every single commit and Pull Request. Despite this, we only need a 4-core 16GB machine to serve their requests. (And similar machines for the replicas, mostly running idle.) Even with this, the CPU usage maxes out at 20%, but even then most of that comes from image processing, so we have a lot of room to scale before we need to bump up the number of cores.
Summary
I think this architecture is excellent for new startups, and I’m hoping more companies will adopt it. Obviously, you’ll need to build out some of the tooling we’ve built out for the language of your choice. (Although, if you choose to use Common Lisp, it’s all here for you to use, and all open-source.)
We’re super grateful to the folk behind bknr.datastore, Braft and Raft, because without their work we wouldn’t be able to do any of this.
If you think this was useful or interesting, I would appreciate it if you could share it on social media. Please reach out to me at [email protected] if you have any questions.