固态硬盘、掉电保护和fsync延迟
SSDs, power loss protection and fsync latency

原始链接: http://smalldatum.blogspot.com/2026/01/ssds-power-loss-protection-and-fsync.html

## SSD 性能与 `fsync` 延迟:总结 本分析研究了使用 `O_DIRECT` 文件访问时,频繁 `fsync` 调用(每次写入后)对性能的影响,重点关注 InnoDB 的 `innodb_flush_method` 选项。核心发现:**SSD 上的写入速度很快,但 `fsync` 操作可能显著较慢**,尤其是在缺乏掉电保护 (PLP) 的消费级驱动器上。 该研究比较了不同 SSD(Crucial T500、Samsung 990 Pro、Intel D7-P5520、Solidigm PM-9a3)和存储配置(本地 NVMe、Google Hyperdisk)的服务器性能。结果表明,消费级 SSD 的 `fsync` 延迟远高于具有 PLP 的企业级 SSD。`fdatasync` 通常比 `fsync` 具有更低的延迟,但仍然会影响性能。 使用 `O_DIRECT_NO_FSYNC` 可以减少 `fsync` 频率,从而提高性能,但需要仔细考虑。虽然旧的实现存在可靠性问题,但现代文件系统会处理必要的元数据同步。 **关键要点:** * **环境很重要:** 性能因 SSD 类型和系统配置而异。 * **PLP 有益:** *没有* PLP 的 SSD 会受到较慢的 `fsync` 时间的影响。 * **测试您的设置:** 在部署性能关键型应用程序之前,请了解 `fsync` 和 `fdatasync` 延迟。

这个Hacker News讨论的核心是写入SSD(特别是NVMe驱动器)时使用`fsync`或flush命令的必要性。主要争论在于这些命令对于具有断电保护(PLP)的现代企业级SSD是否冗余。 一些人认为,如果驱动器具有非易失性写缓存(通常是电池支持的DRAM),或者直接实现IO,flush命令将不起作用。然而,其他人指出,为了兼容性,包含这些命令很重要——尤其是在使用没有PLP的消费级SSD,或不使用直接IO时。 一个关键点是,即使*具有* PLP的驱动器有时也会报告具有易失性写缓存,这使得可靠检测变得困难。此外,测试表明一些驱动器在断电期间并不能可靠地执行flush命令。 这个讨论强调了性能(flush命令会引入延迟)和数据完整性之间的权衡,并强调最佳方法很大程度上取决于特定的硬件和软件环境。一篇关于使用`io_uring`的高性能DBMS的相关文章也被分享了。
相关文章

原文

This has results to measure the impact of calling fsync (or fdatasync) per-write for files opened with O_DIRECT. My goal is to document the impact of the innodb_flush_method option. 

The primary point of this post is to document the claim:

For an SSD without power loss protection, writes are fast but fsync is slow.

The secondary point of this post is to provide yet another example where context matters when reporting performance problems. This post is motivated by results that look bad when run on a server with slow fsync but look OK otherwise. 

tl;dr

  • for my mini PCs I will switch from the Samsung 990 Pro to the Crucial T500 to get lower fsync latency. Both are nice devices but the T500 is better for my use case.
  • with a consumer SSD writes are fast but fsync is often slow
  • use an enterprise SSD if possible, if not run tests to understand fsync and fdatasync latency

Updates:

InnoDB, O_DIRECT and O_DIRECT_NO_FSYNC

When innodb_flush_method is set to O_DIRECT there are calls to fsync after each batch of writes. While  I don't know the source like I used to, I did browse it for this blog post and then I looked at SHOW GLOBAL STATUS counters. I think that InnoDB does the following with it set to O_DIRECT: 

  1. Do one large write to the doublewrite buffer, call fsync on that file
  2. Do the batch of in-place (16kb) page writes
  3. Call fsync once per database file that was written by step 2

When set to O_DIRECT_NO_FSYNC then the frequency of calls to fsync are greatly reduced and are only done in cases where important filesystem metadata needs to be updated, such as after extending a file.  The reference manual is misleading WRT the following sentence. I don't think that InnoDB ever does an fsync after each write. It can do an fsync after each batch of writes:

O_DIRECT_NO_FSYNCInnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call after each write operation.

Many years ago it was risky to use O_DIRECT_NO_FSYNC on some filesystems because the feature as implemented (either upstream or in forks) didn't do fsync for cases where it was needed (see comment about metadata above). I experienced problems from this and I only have myself to blame. But the feature has been enhanced to do the right thing. And if the #whynotpostgres crowd wants to snark about MySQL not caring about data, lets not forget that InnoDB had per-page checksums long before Postgres -- those checksums made web-scale life much easier when using less than stellar hardware.

The following table uses results while running the Insert Benchmark for InnoDB to compute the ratio of fsyncs per write using the SHOW GLOBAL STATUS counters:

Innodb_data_fsyncs / Innodb_data_writes

And from this table a few things are clear. First, there isn't an fsync per write with O_DIRECT but there might be an fsync per batch of writes as explained above. Second, the rate of fsyncs is greatly reduced by using O_DIRECT_NO_FSYNC. 

5.7.44  8.0.44

.01046  .00729  O_DIRECT
.00172  .00053  O_DIRECT_NO_FSYNC

Power loss protection

I am far from an expert on this topic, but most SSDs have a write-buffer that makes small writes fast. And one way to achieve speed is to buffer those writes in RAM on the SSD while waiting for enough data to be written to an extent. But that speed means there is a risk of data loss if a server loses power. Some SSDs, especially those marketed as enterprise SSDs, have a feature called power loss protection that make data loss unlikely. Other SSDs, lets call them consumer SSDs, don't have that feature while some of the consumer SSDs claim to make a best effort to flush writes from the write buffer on power loss.

One solution to avoiding risk is to only buy enterprise SSDs. But they are more expensive, less common, and many are larger (22120 rather than 2280) because more room is needed for the capacitor or other HW that provides the power loss protection. Note that power loss protection is often abbreviated as PLP.

For devices without power loss protection it is often true that writes are fast but fsync is slow. When fsync is slow then calling fsync more frequently in InnoDB will hurt performance.

Results from fio

I used this fio script to measure performance for writes for files opened with O_DIRECT. The test was run twice configuration for 5 minutes per run followed by a 5 minute sleep. This was repeated for 1, 2, 4, 8, 16 and 32 fio jobs but I only share results here for 1 job. The configurations tested were:

  • O_DIRECT without fsync, 16kb writes
  • O_DIRECT with an fsync per write, 16kb writes
  • O_DIRECT with an fdatasync per write, 16kb writes
  • O_DIRECT without fsync, 2M writes
  • O_DIRECT with an fsync per write, 2M writes
  • O_DIRECT with an fdatasync per write, 2M writes
Results from all tests are here. I did the test on several servers:
  • dell32
    • a large server I have at home. The SSD is a Crucial T500 2TB using ext-4 with discard enabled and Ubuntu 24.04. This is a consumer SSD. While the web claims it has PLP via capacitors the fsync latency for it was almost 1 millisecond.
  • gcp
    • a c3d-standard-30-lssd from the Google cloud with 2 local NVMe devices using SW RAID 0 and 1TB of Hyperdisk Balanced storage configured for 50,000 IOPs and 800MB/s of throughput. The OS is Ubuntu 24.04 and I repeated tests for both ext-4 and xfs, both with discard enabled. I was not able to determine the brand of the local NVMe devices.
  • hetz
    • an ax162-s from Hetzner with 2 local NVME devices using SW RAID 1. Via udiskctl status I learned the devices are Intel D7-P5520 (now Solidigm). These are datacenter SSDs and the web claims they have power loss protection. The OS is Ubuntu 24.04 and the drives use ext-4 without discard enabled. 
  • ser7
  • socket2
    • a 2-socket server I have at home. The SSD is a Samsung PM-9a3. This is an enterprise SSD with power loss protection. The OS is Ubuntu 24.04 and the drives use ext-4 with discard enabled.

Results: overview

This table lists fsync and fdatasync latency per server:

  • for servers with consumer SSDs (dell, ser7) the latency is much larger on the ser7 that uses a Samsung 990 Pro than on the dell that uses a Crucial T500. This is to be expected given that the T500 has PLP while the 990 Pro does not.
  • sync latency is much lower on servers with enterprise SSDs
  • sync latency after 2M writes is sometimes much larger than after 16kb writes
  • for the Google server with Hyperdisk Balanced storage the fdatasync latency was good but fsync latency was high. While with the local NVMe devices the latencies were larger than for enterprise SSDs but much smaller than for consumer SSDs.

--- Sync latency in microseconds for sync after 16kb writes

dell    hetz    ser7    socket2

891.1   12.4    2974.2  1.6     fsync

447.4    9.8    2783.2  0.7     fdatasync

gcp

local devices           hyperdisk

ext-4   xfs             ext-4   xfs

56.2    39.5            738.1   635.0   fsync

28.1    29.0             46.8    46.0   fdatasync

--- Sync latency in microseconds for sync after 2M writes

dell    hetz    ser7    socket2

980.1   58.2    5396.8  139.1   fsync

449.7   10.8    3508.2    2.2   fdatasync

gcp

local devices           hyperdisk

ext-4   xfs             ext-4   xfs

1020.4  916.8           821.2   778.9   fsync

 832.4  809.7            63.6    51.2   fdatasync

Results: dell

Summary:

  • Write throughput drops dramatically when there is an fsync or fdatasync per write because sync latency is large.
  • This servers uses a consumer SSD so high sync latency is expected

Legend:

  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)

16 KB writes

w/s     MB/s    sync    test

43400   646.6   0.0     no-sync

43500   648.5   0.0     no-sync

-

1083    16.1    891.1   fsync

1085    16.2    889.2   fsync

-

2100    31.3    447.4   fdatasync

2095    31.2    448.6   fdatasync

2 MB writes

w/s     MB/s    sync    test

2617    4992.5  0.0     no-sync

2360    4502.3  0.0     no-sync

-

727     1388.5  980.1   fsync

753     1436.2  942.5   fsync

-

1204    2297.4  449.7   fdatasync

1208    2306.0  446.9   fdatasync

Results: gcp

Summary

  • Local NVMe devices have lower sync latency and more throughput with and without a sync per write at low concurrency (1 fio job).
  • At higher concurrency (32 fio jobs), the Hyperdisk Balanced setup provides similar throughput to local NVMe and would do even better had I paid more to get more IOPs and throughput. Results don't have nice formatting but are here for xfs on the local and Hyperdisk Balanced devices.
  • fsync latency is ~2X larger than fdatasync on the local devices and closer to 15X larger on the Hyperdisk Balanced setup. That difference is interesting. I wonder what the results are for Hyperdisk Extreme.

Legend:

  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)

--- ext-4 and local devices

16 KB writes

w/s     MB/s    sync    test

10100   150.7   0.0     no-sync

10300   153.5   0.0     no-sync

-

6555    97.3    56.2    fsync

6607    98.2    55.1    fsync

-

8189    122.1   28.1    fdatasync

8157    121.1   28.2    fdatasync

2 MB writes

w/s     MB/s    sync    test

390     744.8   0.0     no-sync

390     744.8   0.0     no-sync

-

388     741.0   1020.4  fsync

388     741.0   1012.7  fsync

-

390     744.8   832.4   fdatasync

390     744.8   869.6   fdatasync

--- xfs and local devices

16 KB writes

w/s     MB/s    sync    test

9866    146.9   0.0     no-sync

9730    145.0   0.0     no-sync

-

7421    110.6   39.5    fsync

7537    112.5   38.3    fsync

-

8100    121.1   29.0    fdatasync

8117    121.1   28.8    fdatasync

2 MB writes

w/s     MB/s    sync    test

390     744.8   0.0     no-sync

390     744.8   0.0     no-sync

-

389     743.9   916.8   fsync

389     743.9   919.1   fsync

-

390     744.8   809.7   fdatasync

390     744.8   806.5   fdatasync

--- ext-4 and Hyperdisk Balanced

16 KB writes

w/s     MB/s    sync    test

2093    31.2    0.0     no-sync

2068    30.8    0.0     no-sync

-

804     12.0    738.1   fsync

798     11.9    740.6   fsync

-

1963    29.3    46.8    fdatasync

1922    28.6    49.0    fdatasync

2 MB writes

w/s     MB/s    sync    test

348     663.8   0.0     no-sync

367     701.0   0.0     no-sync

-

278     531.2   821.2   fsync

271     517.8   814.1   fsync

-

358     683.8   63.6    fdatasync

345     659.0   64.5    fdatasync

--- xfs and Hyperdisk Balanced

16 KB writes

w/s     MB/s    sync    test

2033    30.3    0.0     no-sync

2004    29.9    0.0     no-sync

-

870     13.0    635.0   fsync

858     12.8    645.0   fsync

-

1787    26.6    46.0    fdatasync

1727    25.7    49.6    fdatasync

2 MB writes

w/s     MB/s    sync    test

343     655.2   0.0     no-sync

343     655.2   0.0     no-sync

-

267     511.2   778.9   fsync

268     511.2   774.7   fsync

-

347     661.8   51.2    fdatasync

336     642.8   54.4    fdatasync

Results: hetz

Summary

  • this has an enterprise SSD with excellent (low) sync latency

Legend:

  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)

16 KB writes

w/s     MB/s    sync    test

37700   561.7   0.0     no-sync

37500   558.9   0.0     no-sync

-

25200   374.8   12.4    fsync

25100   374.8   12.4    fsync

-

27600   411.0   0.0     fdatasync

27200   404.4   9.8     fdatasync

2 MB writes

w/s     MB/s    sync    test

1833    3497.1  0.0     no-sync

1922    3667.8  0.0     no-sync

-

1393    2656.9  58.2    fsync

1355    2585.4  59.6    fsync

-

1892    3610.6  10.8    fdatasync

1922    3665.9  10.8    fdatasync

Results: ser7

Summary:

  • this has a consumer SSD with high sync latency
  • results had much variance (see the 2MB results below) and results at higher concurrency. This is a great SSD, but not for my use case.

Legend:

  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)

16 KB writes

w/s     MB/s    sync    test

34000   506.4   0.0     no-sync

40200   598.9   0.0     no-sync

-

325     5.0     2974.2  fsync

333     5.1     2867.3  fsync

-

331     5.1     2783.2  fdatasync

330     5.0     2796.1  fdatasync

2 MB writes

w/s     MB/s    sync    test

362     691.4   0.0     no-sync

364     695.2   0.0     no-sync

-

67      128.7   10828.3 fsync

114     218.4   5396.8  fsync

-

141     268.9   3864.0  fdatasync

192     368.1   3508.2  fdatasync

Results: socket2

Summary:

  • this has an enterprise SSD with excellent (low) sync latency after small writes, but fsync latency after 2MB writes is much larger

Legend:

  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)

16 KB writes

w/s     MB/s    sync    test

49500   737.2   0.0     no-sync

49300   734.3   0.0     no-sync

-

44500   662.8   1.6     fsync

45400   676.2   1.5     fsync

-

46700   696.2   0.7     fdatasync

45200   674.2   0.7     fdatasync

2 MB writes

w/s     MB/s    sync    test

707     1350.4  0.0     no-sync

708     1350.4  0.0     no-sync

-

703     1342.8  139.1   fsync

703     1342.8  122.5   fsync

-

707     1350.4  2.2     fdatasync

707     1350.4  2.1     fdatasync

联系我们 contact @ memedata.com