固态硬盘的数据库是什么样的？

固态硬盘的数据库是什么样的？
What Does a Database for SSDs Look Like?

原始链接: https://brooker.co.za/blog/2025/12/15/database-for-ssd.html

## 重新构想面向现代硬件的数据库核心问题是：如果利用90年代以来的进步——特别是快速的本地SSD和强大的云基础设施，我们今天应该如何从头设计一个关系数据库？当前的数据库是为慢速旋转磁盘构建的，优先考虑诸如预写日志和批量缓冲等技术。分析表明，有几个关键的改变。首先，**缓存大小**应侧重于预期访问的30秒到5分钟窗口，尽管存储速度已大大提高，但这仍然符合Jim Gray的“五分钟规则”。其次，**传输大小**应以大约32kB为目标，以最大化SSD吞吐量。至关重要的是，**持久性**从对本地磁盘的依赖转变为跨可用区进行分布式复制，并接受由此产生的延迟以实现强一致性。这会影响**隔离**策略，倾向于像Aurora DSQL这样的方法，即推迟跨AZ的通信到提交时间。最后，对**预写日志**的传统依赖性，用于单系统持久性，变得不那么重要，取而代之的是基于分布式日志的恢复。虽然SQL和原子性等核心关系原则仍然有价值，但重点转移到分布式系统关注的问题上，例如规模、可用性和性能。现代数据库应优先优化实际网络条件和应用程序架构，利用诸如高质量时钟来实现一致性等功能。

这个Hacker News讨论围绕着SSD对数据库设计的影响。核心观点是，尽管SSD比传统硬盘快得多，许多既定的数据库技术仍然适用。目前，数据库通常会写入完整的4KB页面，即使是很小的改动（例如更新时间戳），这是由于与操作系统的块级接口所致。这需要诸如预写式日志和缓冲之类的技术来批量写入并提高效率。评论者认为，这些策略并非仅仅为了*速度*而构建，而是为了缓解旧磁盘上顺序访问和随机访问之间巨大的性能差距。由于SSD *仍然*存在相当大的顺序和随机I/O差异，因此缓冲和日志记录的好处可能仍然存在，这意味着为SSD进行激进的数据库重新设计不一定是有必要的。

原文

Maybe not what you think.

Over on X, Ben Dicken asked:

What does a relational database designed specifically for local SSDs look like? Postgres, MySQL, SQLite and many others were invented in the 90s and 00s, the era of spinning disks. A local NVMe SSD has ~1000x improvement in both throughput and latency. Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random. If we had to throw these databases away and begin from scratch in 2025, what would change and what would remain?

How might we tackle this question quantitatively for the modern transaction-orientated database?

But first, the bigger picture. It’s not only SSDs that have come along since databases like Postgres were first designed. We also have the cloud, with deployments to excellent datacenter infrastructure, including multiple independent datacenters with great network connectivity between them, available to all. Datacenter networks offer 1000x (or more) increased throughput, along with latency in the microseconds. Servers with hundreds of cores and thousands of gigabytes of RAM are mainstream.

Applications have changed too. Companies are global, businesses are 24/7. Down time is expensive, and that expense can be measured. The security and compliance environment is much more demanding. Builders want to deploy in seconds, not days.

Approach One: The Five Minute Rule

Perhaps my single favorite systems paper, The 5 Minute Rule… by Jim Gray and Franco Putzolu gives us a very simple way to answer one of the most important questions in systems: how big should caches be? The five minute rule is that, back in 1986, if you expected to read a page again within five minutes you should keep in in RAM. If not, you should keep it on disk. The basic logic is that you look at the page that’s least likely to be re-used. If it’s cheaper to keep around until it’s next expected re-use, then you should keep more. If it’s cheaper to reload from storage than keep around, then you should keep less¹. Let’s update the numbers for 2025, assuming that pages are around 32kB² (this becomes important later).

The EC2 i8g.48xlarge delivers about 1.8 million read iops of this size, at a price of around $0.004576 per second, or $10^{-9}$ dollars per transfer (assuming we’re allocating about 40% of the instance price to storage). About one dollar per billion reads. It also has enough RAM for about 50 million pages of this size, costing around $3 \times 10^{-11}$ dollars to storage a page for one second.

So, on this instance type, we should size our RAM cache to store pages for about 30 seconds. Not too different from Gray and Putzolu’s result 40 years ago!

That’s answer number one: the database should have a cache sized so that the hot set contains pages expected to be accessed in the next 30 seconds, for optimal cost. For optimal latency, however, the cache may want to be considerably bigger.

Approach Two: The Throughput/IOPS Breakeven Point

The next question is what size accesses we want to send to our storage devices to take best advantage of their performance. In the days of spinning media, the answer to this was surprisingly big: a 100MB/s disk could generally do around 100 seeks a second, so if your transfers were less than around 1MB you were walking away from throughput. Give or take a factor of 2. What does it look like for modern SSDs?

SSDs are much faster on both throughput and iops. They’re less sensitive than spinning drives to workload patterns, but read/write ratios and the fullness of the drives still matter. Absent benchmarking on the actual hardware with the real workload, my rule of thumb is that SSDs are throughput limited for transfers bigger than 32kB, and iops limited for transfers smaller than 32kB.

Making transfers bigger than 32kB doesn’t help throughput much, reduces IOPS, and probably makes the cache less effective because of false sharing and related effects. This is especially important for workloads with poor spatial locality.

So that’s answer number two: we want our transfers to disk not to be much smaller than 32kB on average, or we’re walking away from throughput.

Approach Three: Durability and Replication

Building reads on local SSDs is great: tons of throughput, tons of iops. Writes on local SSDs, on the other hand, have the distinct problem of only being durable on the local box, which is unacceptable for most workloads. Modern hardware is very reliable, but thinking through the business risks of losing data on failover isn’t very fun at all, so let’s assume that our modern database is going to replicate off-box, making at least one more synchronous copy. Ideally in a different availability zone (AZ).

That i8g.48xlarge we were using for our comparison earlier has 100Gb/s (or around 12GB/s) of network bandwidth. That puts a cap on how much write throughput we can have for a single-leader database. Cross-AZ latency in EC2 varies from a couple hundred microseconds to a millisecond or two, which puts a minimum on our commit latency.

That gives us answer number three: we want to incur cross-AZ latency only at commit time, and not during writes.

Which is where we run into one of my favorite topics: isolation. The I in ACID. A modern database design will avoid read-time coordination using multiversioning, but to offer isolation stronger than READ COMMITTED will need to coordinate either on each write or at commit time. It can do that like, say, Aurora Postgres does, having a single leader at a time running in a single AZ. This means great latency for clients in that zone, and higher latency for clients in different AZs. Given that most applications are hosted in multiple AZs, this can add up for latency-sensitive applications which makes a lot of round trips to the database. The alternative approach is the one Aurora DSQL takes, doing the cross-AZ round trip only at COMMIT time, saving round-trips.

Here’s me talking about the shape of that trade-off at re:Invent this year:

There’s no clear answer here, because there are real trade-offs between the two approaches. But do make sure to ask your database vendor whether those impressive latency benchmarks are running where you application actually runs. In the spirit of the original question, though, the incredible bandwidth and latency availability in modern datacenter networks is as transformative as SSDs in database designs. Or should be.

While we’re incurring the latency cost of synchronous replication, we may as well get strongly consistent scale-out reads for free. In DSQL, we do this using high-quality hardware clocks that you can use too. Another nice win from modern hardware. There are other approaches too.

That’s answer number four for me: The modern database uses high-quality clocks and knowledge of actual application architectures to optimize for real-world performance (like latency in multiple availability zones or regions) without compromising on strong consistency.

Approach Four: What about that WAL?

Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random.

WALs, and related low-level logging details, are critical for database systems that care deeply about durability on a single system. But the modern database isn’t like that: it doesn’t depend on commit-to-disk on a single system for its durability story. Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails).

That’s answer number five: the modern database commits transactions to a distributed log, which provides multi-machine multi-AZ durability, and might provide other services like atomicity. Recovery is a replay from the distributed log, on any one of a number of peer replicas.

What About Data Structures?

B-Trees versus LSM-trees vs B-Tree variants versus LSM variants versus other data structures are trade-offs that have a lot to do with access patterns and workload patterns. Picking a winner would be a whole series of blog posts, so I’m going to chicken out and say its complicated.

Conclusion

If we had to throw these databases away and begin from scratch in 2025, what would change and what would remain?

I’d keep the relational model, atomicity, isolation (but would probably pick SNAPSHOT as a default), strong consistency, SQL, interactive transactions, and the other core design decisions of relational databases. But I’d move durability, read and write scale, and high availability into being distributed rather than single system concerns. I think that helps with performance and cost, while making these properties easier to achieve. I’d mostly toss out local durability and recovery, and all the huge history of optimizations and data structures around that³, in favor of getting better properties in the distributed setting. I’d pay more attention to internal strong isolation (in the security sense) between clients and workloads. I’d size caches for a working set of between 30 seconds and 5 minutes of accesses. I’d optimize for read transfers around that 32kB sweet spot from local SSD, and the around 8kB sweet spot for networks.

Probably more stuff too, but this is long enough as-is.

Other topics worth covering include avoiding copies on IO, co-design with virtualization (e.g. see our Aurora Serverless paper), trade-offs of batching, how the relative performance of different isolation levels changes, what promises to give clients, encryption and authorization of data at rest and in motion, dealing with very hot single items, new workloads like vector, verifiable replication journals, handing off changes to analytics systems, access control, multi-tenancy, forking and merging, and even locales.

Footnotes

The reasoning is slightly smarter, thinking about the marginal page and marginal cost of memory, but this simplification works for our purposes here. The marginal cost of memory is particularly interesting in a provisioned system, because it varies between zero (you’ve paid for it already) and huge (you need a bigger instance size). One of the really nice things about serverless (like DSQL) and dynamic scaling (like Aurora Serverless) is that it makes the marginal cost constant, greatly simplifying the task of reasoning about cache size.

Yes, I know that pages are typically 4kB or 2MB, but bear with me here.

Sorry ARIES.

固态硬盘的数据库是什么样的？ What Does a Database for SSDs Look Like?

固态硬盘的数据库是什么样的？
What Does a Database for SSDs Look Like?