Litestream 可写 VFS

Litestream 可写 VFS
Litestream Writable VFS

原始链接: https://fly.io/blog/litestream-writable-vfs/

## Litestream 与 Fly.io Sprites 摘要 Litestream 是一款免费开源工具，旨在将 SQLite 数据库与 S3 风格的对象存储同步，提供强大的备份和恢复解决方案，同时不牺牲 SQLite 的速度和简洁性。近来，Litestream 已成为 Fly.io 新的“Sprites”的核心组件——极其快速、可扩展的无服务器容器。 Sprites 通过两种关键方式利用 Litestream：作为全球编排器的基础（取代传统的 Postgres 集群），以及直接在 Sprite 的存储堆栈中使用。存储堆栈利用 Litestream 实现快速启动时间（低于一秒），并提供 100GB 的持久存储。诸如 Litestream VFS（虚拟文件系统）的新功能允许直接从对象存储进行按时间点的 SQLite 查询，即使在冷启动期间也是如此。一种可写 VFS 模式，在同步到对象存储之前缓冲写入，进一步提高了性能。“补水”——后台数据库下载——通过最终从本地副本提供查询来提高稳定状态下的性能。这些功能专为 Sprites 苛刻的需求而设计，优先考虑速度和最终持久性。虽然可能对其他应用程序有用，但 Litestream 仍然是作为伴随进程进行标准读/写 SQLite 同步的强大而高效的解决方案。

## Litestream & Postgres 备份讨论一个 Hacker News 讨论围绕着将数据库备份流式传输到 S3 的解决方案，灵感来自 Litestream (一个可写 VFS)。原始发帖者询问是否有 Litestream 适用于 Postgres 的替代方案，具体希望能够流式传输 `pg_dump` 或使用 `barman` 将数据传输到 S3。提供了一些建议。一位用户指出 Postgres 内置的 WAL 归档功能。其他人推荐 **ZeroFS**，这是一种与 Postgres 配合良好的文件系统，用于复制，以及 **wal-g**，这是一种专门用于将 WAL 文件移动到 S3 以进行服务器复制的工具。对话简要讨论了这些工具与传统备份方法（如 `pg_dump` 和 `barman`）之间的区别，明确了 wal-g 侧重于连续归档而不是按时间点快照。

## Litestream & 数据库同步讨论最近的 Hacker News 讨论集中在 Litestream 上，这是一种用于同步 SQLite 数据库的工具。一位用户正在探索为其 Litestream VFS 实现添加写入能力，并考虑使用写入租约以防止多写方造成数据损坏。他们的目标是使用 `fullfsync` 实现持久性，并提出一个系统，允许读者访问先前的状态，而写方将排队等待更改。其他评论者赞扬 Litestream 与 Pocketbase 的功能，并询问它是否能够以灵活的扇出方式同步多个 `.db` 文件——目前该功能尚不支持。对话还延伸到 PostgreSQL 同步的替代方案。建议包括 `wal-g`，用于将 WAL 文件流式传输到 S3，以及 ZeroFS，它提供了一个专为与 Postgres 副本配合使用而设计的的文件系统。最后，发帖人为由于提交过程中的一个错误导致的时间戳不准确表示歉意。

原文

I’m Ben Johnson, and I work on Litestream at Fly.io. Litestream is the missing backup/restore system for SQLite. It’s free, open-source software that should run anywhere, and you can read more about it here.

Each time we write about it, we get a little bit better at golfing down a description of what Litestream is. Here goes: Litestream is a Unix-y tool for keeping a SQLite database synchronized with S3-style object storage. It’s a way of getting the speed and simplicity wins of SQLite without exposing yourself to catastrophic data loss. Your app doesn’t necessarily even need to know it’s there; you can just run it as a tool in the background.

It’s been a busy couple weeks!

We recently unveiled Sprites. If you don’t know what Sprites are, you should just go check them out. They’re one of the coolest things we’ve ever shipped. I won’t waste any more time selling them to you. Just, Sprites are a big deal, and so it’s a big deal to me that Litestream is a load-bearing component for them.

Sprites rely directly on Litestream in two big ways.

First, Litestream SQLite is the core of our global Sprites orchestrator. Unlike our flagship Fly Machines product, which relies on a centralized Postgres cluster, our Elixir Sprites orchestrator runs directly off S3-compatible object storage. Every organization enrolled in Sprites gets their own SQLite database, synchronized by Litestream.

This is a fun design. It takes advantage of the “many SQLite databases” pattern, which is under-appreciated. It’s got nice scaling characteristics. Keeping that Postgres cluster happy as Fly.io grew has been a major engineering challenge.

But as far as Litestream is concerned, the orchestrator is boring, and so that’s all I’ve got to say about it. The second way Sprites use Litestream is much more interesting.

Litestream is built directly into the disk storage stack that runs on every Sprite.

Sprites launch in under a second, and every one of them boots up with 100GB of durable storage. That’s a tricky bit of engineering. We’re able to do this because the root of storage for Sprites is S3-compatible object storage, and we’re able to make it fast by keeping a database of in-use storage blocks that takes advantage of attached NVMe as a read-through cache. The system that does this is JuiceFS, and the database — let’s call it “the block map” — is a rewritten metadata store, based (you guessed it) on BoltDB.

I kid! It’s Litestream SQLite, of course.

Sprite Storage Is Fussy

Everything in a Sprite is designed to come up fast.

If the Fly Machine underneath a Sprite bounces, we might need to reconstitute the block map from object storage. Block maps aren’t huge, but they’re not tiny; maybe low tens of megabytes worst case.

The thing is, this is happening while the Sprite boots back up. To put that in perspective, that’s something that can happen in response to an incoming web request; that is, we have to finish fast enough to generate a timely response to that request. The time budget is small.

To make this even faster, we are integrating Litestream VFS to improve start times.The VFS is a dynamic library you load into your app. Once you do, you can do stuff like this:

    sqlite> .open file:///my.db?vfs=litestream
sqlite> PRAGMA litestream_time = '5 minutes ago'; 
sqlite> SELECT * FROM sandwich_ratings ORDER BY RANDOM() LIMIT 3 ; 
22|Veggie Delight|New York|4
30|Meatball|Los Angeles|5
168|Chicken Shawarma Wrap|Detroit|5

  

Litestream VFS lets us run point-in-time SQLite queries hot off object storage blobs, answering queries before we’ve downloaded the database.

This is good, but it’s not perfect. We had two problems:

We could only read, not write. People write to Sprite disks. The storage stack needs to write, right away.
Running a query off object storage is a godsend in a cold start where we have no other alternative besides downloading the whole database, but it’s not fast enough for steady state.

These are fun problems. Here’s our first cut at solving them.

Writable VFS

The first thing we’ve done is made the VFS optionally read-write. This feature is pretty subtle; it’s interesting, but it’s not as general-purpose as it might look. Let me explain how it works, and then explain why it works this way.

Keep in mind as you read this that this is about the VFS in particular. Obviously, normal SQLite databases using Litestream the normal way are writeable.

The VFS works by keeping an index of (file,offset, size) for every page of the database in object storage; the data comprising the index is stored, in LTX files, so that it’s efficient for us to reconstitute it quickly when the VFS starts, and lookups are heavily cached. When we queried sandwich_ratings earlier, our VFS library intercepted the SQLite read method, looked up the requested page in the index, fetched it, and cached it.

This works great for reads. Writes are harder.

Behind the scenes in read-only mode, Litestream polls, so that we can detect new LTX files created by remote writers to the database. This supports a handy use case where we’re running tests or doing slow analytical queries of databases that need to stay fast in prod.

In write mode, we don’t allow multiple writers, because multiple-writer distributed SQLite databases are the Lament Configuration and we are not explorers over great vistas of pain. So the VFS in write-mode disables polling. We assume a single writer, and no additional backups to watch.

Next, we buffer. Writes go to a local temporary buffer (“the write buffer”). Every second or so (or on clean shutdown), we sync the write buffer with object storage. Nothing written through the VFS is truly durable until that sync happens.

Most storage block maps are much smaller than this, but still.

Now, remember the use case we’re looking to support here. A Sprite is cold-starting and its storage stack needs to serve writes, milliseconds after booting, without having a full copy of the 10MB block map. This writeable VFS mode lets us do that.

Critically, we support that use case only up to the same durability requirements that a Sprite already has. All storage on a Sprite shares this “eventual durability” property, so the terms of the VFS write make sense here. They probably don’t make sense for your application. But if for some reason they do, have at it! To enable writes with Litestream VFS, just set the LITESTREAM_WRITE_ENABLED environment variable "true".

Hydration

The Sprite storage stack uses SQLite in VFS mode. In our original VFS design, most data is kept in S3. Again: fine at cold start, not so fine in steady state.

To solve this problem, we shoplifted a trick from systems like dm-clone: background hydration. In hydration designs, we serve queries remotely while running a loop to pull the whole database. When you start the VFS with the LITESTREAM_HYDRATION_PATH environment variable set, we’ll hydrate to that file.

Hydration takes advantage of LTX compaction, writing only the latest versions of each page. Reads don’t block on hydration; we serve them from object storage immediately, and switch over to the hydration file when it’s ready.

As for the hydration file? It’s simply a full copy of your database. It’s the same thing you get if you run litestream restore.

Because this is designed for environments like Sprites, which bounce a lot, we write the database to a temporary file. We can’t trust that the database is using the latest state every time we start up, not without doing a full restore, so we just chuck the hydration file when we exit the VFS. That behavior is baked into the VFS right now. This feature’s got what Sprites need, but again, maybe not what your app wants.

Putting It All Together

This is a post about two relatively big moves we’ve made with our open-source Litestream project, but the features are narrowly scoped for problems that look like the ones our storage stack needs. If you think you can get use out of them, I’m thrilled, and I hope you’ll tell me about it.

For ordinary read/write workloads, you don’t need any of this mechanism. Litestream works fine without the VFS, with unmodified applications, just running as a sidecar alongside your application. The whole point of that configuration is to efficiently keep up with writes; that’s easy when you know you have the whole database to work with when writes happen.

But this whole thing is, to me, a valuable case study in how Litestream can get used in a relatively complicated and demanding problem domain. Sprites are very cool, and it’s satisfying to know that every disk write that happens on a Sprite is running through Litestream.