This has results to measure the impact of calling fsync (or fdatasync) per-write for files opened with O_DIRECT. My goal is to document the impact of the innodb_flush_method option.
The primary point of this post is to document the claim:
For an SSD without power loss protection, writes are fast but fsync is slow.
The secondary point of this post is to provide yet another example where context matters when reporting performance problems. This post is motivated by results that look bad when run on a server with slow fsync but look OK otherwise.
tl;dr
- for my mini PCs I will switch from the Samsung 990 Pro to the Crucial T500 to get lower fsync latency. Both are nice devices but the T500 is better for my use case.
- with a consumer SSD writes are fast but fsync is often slow
- use an enterprise SSD if possible, if not run tests to understand fsync and fdatasync latency
Updates:
InnoDB, O_DIRECT and O_DIRECT_NO_FSYNC
When innodb_flush_method is set to O_DIRECT there are calls to fsync after each batch of writes. While I don't know the source like I used to, I did browse it for this blog post and then I looked at SHOW GLOBAL STATUS counters. I think that InnoDB does the following with it set to O_DIRECT:
- Do one large write to the doublewrite buffer, call fsync on that file
- Do the batch of in-place (16kb) page writes
- Call fsync once per database file that was written by step 2
When set to O_DIRECT_NO_FSYNC then the frequency of calls to fsync are greatly reduced and are only done in cases where important filesystem metadata needs to be updated, such as after extending a file. The reference manual is misleading WRT the following sentence. I don't think that InnoDB ever does an fsync after each write. It can do an fsync after each batch of writes:
O_DIRECT_NO_FSYNC: InnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call after each write operation.
Many years ago it was risky to use O_DIRECT_NO_FSYNC on some filesystems because the feature as implemented (either upstream or in forks) didn't do fsync for cases where it was needed (see comment about metadata above). I experienced problems from this and I only have myself to blame. But the feature has been enhanced to do the right thing. And if the #whynotpostgres crowd wants to snark about MySQL not caring about data, lets not forget that InnoDB had per-page checksums long before Postgres -- those checksums made web-scale life much easier when using less than stellar hardware.
The following table uses results while running the Insert Benchmark for InnoDB to compute the ratio of fsyncs per write using the SHOW GLOBAL STATUS counters:
Innodb_data_fsyncs / Innodb_data_writes
And from this table a few things are clear. First, there isn't an fsync per write with O_DIRECT but there might be an fsync per batch of writes as explained above. Second, the rate of fsyncs is greatly reduced by using O_DIRECT_NO_FSYNC.
5.7.44 8.0.44
.01046 .00729 O_DIRECT
.00172 .00053 O_DIRECT_NO_FSYNC
Power loss protection
I am far from an expert on this topic, but most SSDs have a write-buffer that makes small writes fast. And one way to achieve speed is to buffer those writes in RAM on the SSD while waiting for enough data to be written to an extent. But that speed means there is a risk of data loss if a server loses power. Some SSDs, especially those marketed as enterprise SSDs, have a feature called power loss protection that make data loss unlikely. Other SSDs, lets call them consumer SSDs, don't have that feature while some of the consumer SSDs claim to make a best effort to flush writes from the write buffer on power loss.
One solution to avoiding risk is to only buy enterprise SSDs. But they are more expensive, less common, and many are larger (22120 rather than 2280) because more room is needed for the capacitor or other HW that provides the power loss protection. Note that power loss protection is often abbreviated as PLP.
For devices without power loss protection it is often true that writes are fast but fsync is slow. When fsync is slow then calling fsync more frequently in InnoDB will hurt performance.
Results from fio
I used this fio script to measure performance for writes for files opened with O_DIRECT. The test was run twice configuration for 5 minutes per run followed by a 5 minute sleep. This was repeated for 1, 2, 4, 8, 16 and 32 fio jobs but I only share results here for 1 job. The configurations tested were:
- O_DIRECT without fsync, 16kb writes
- O_DIRECT with an fsync per write, 16kb writes
- O_DIRECT with an fdatasync per write, 16kb writes
- O_DIRECT without fsync, 2M writes
- O_DIRECT with an fsync per write, 2M writes
- O_DIRECT with an fdatasync per write, 2M writes
- dell32
- a large server I have at home. The SSD is a Crucial T500 2TB using ext-4 with discard enabled and Ubuntu 24.04. This is a consumer SSD. While the web claims it has PLP via capacitors the fsync latency for it was almost 1 millisecond.
- gcp
- a c3d-standard-30-lssd from the Google cloud with 2 local NVMe devices using SW RAID 0 and 1TB of Hyperdisk Balanced storage configured for 50,000 IOPs and 800MB/s of throughput. The OS is Ubuntu 24.04 and I repeated tests for both ext-4 and xfs, both with discard enabled. I was not able to determine the brand of the local NVMe devices.
- hetz
- an ax162-s from Hetzner with 2 local NVME devices using SW RAID 1. Via udiskctl status I learned the devices are Intel D7-P5520 (now Solidigm). These are datacenter SSDs and the web claims they have power loss protection. The OS is Ubuntu 24.04 and the drives use ext-4 without discard enabled.
- ser7
- socket2
- a 2-socket server I have at home. The SSD is a Samsung PM-9a3. This is an enterprise SSD with power loss protection. The OS is Ubuntu 24.04 and the drives use ext-4 with discard enabled.
Results: overview
This table lists fsync and fdatasync latency per server:
- for servers with consumer SSDs (dell, ser7) the latency is much larger on the ser7 that uses a Samsung 990 Pro than on the dell that uses a Crucial T500. This is to be expected given that the T500 has PLP while the 990 Pro does not.
- sync latency is much lower on servers with enterprise SSDs
- sync latency after 2M writes is sometimes much larger than after 16kb writes
- for the Google server with Hyperdisk Balanced storage the fdatasync latency was good but fsync latency was high. While with the local NVMe devices the latencies were larger than for enterprise SSDs but much smaller than for consumer SSDs.
--- Sync latency in microseconds for sync after 16kb writes
dell hetz ser7 socket2
891.1 12.4 2974.2 1.6 fsync
447.4 9.8 2783.2 0.7 fdatasync
gcp
local devices hyperdisk
ext-4 xfs ext-4 xfs
56.2 39.5 738.1 635.0 fsync
28.1 29.0 46.8 46.0 fdatasync
--- Sync latency in microseconds for sync after 2M writes
dell hetz ser7 socket2
980.1 58.2 5396.8 139.1 fsync
449.7 10.8 3508.2 2.2 fdatasync
gcp
local devices hyperdisk
ext-4 xfs ext-4 xfs
1020.4 916.8 821.2 778.9 fsync
832.4 809.7 63.6 51.2 fdatasync
Results: dell
Summary:
- Write throughput drops dramatically when there is an fsync or fdatasync per write because sync latency is large.
- This servers uses a consumer SSD so high sync latency is expected
Legend:
- w/s - writes/s
- MB/s - MB written/s
- sync - latency per sync (fsync or fdatasync)
16 KB writes
w/s MB/s sync test
43400 646.6 0.0 no-sync
43500 648.5 0.0 no-sync
-
1083 16.1 891.1 fsync
1085 16.2 889.2 fsync
-
2100 31.3 447.4 fdatasync
2095 31.2 448.6 fdatasync
2 MB writes
w/s MB/s sync test
2617 4992.5 0.0 no-sync
2360 4502.3 0.0 no-sync
-
727 1388.5 980.1 fsync
753 1436.2 942.5 fsync
-
1204 2297.4 449.7 fdatasync
1208 2306.0 446.9 fdatasync
Results: gcp
Summary
- Local NVMe devices have lower sync latency and more throughput with and without a sync per write at low concurrency (1 fio job).
- At higher concurrency (32 fio jobs), the Hyperdisk Balanced setup provides similar throughput to local NVMe and would do even better had I paid more to get more IOPs and throughput. Results don't have nice formatting but are here for xfs on the local and Hyperdisk Balanced devices.
- fsync latency is ~2X larger than fdatasync on the local devices and closer to 15X larger on the Hyperdisk Balanced setup. That difference is interesting. I wonder what the results are for Hyperdisk Extreme.
Legend:
- w/s - writes/s
- MB/s - MB written/s
- sync - latency per sync (fsync or fdatasync)
--- ext-4 and local devices
16 KB writes
w/s MB/s sync test
10100 150.7 0.0 no-sync
10300 153.5 0.0 no-sync
-
6555 97.3 56.2 fsync
6607 98.2 55.1 fsync
-
8189 122.1 28.1 fdatasync
8157 121.1 28.2 fdatasync
2 MB writes
w/s MB/s sync test
390 744.8 0.0 no-sync
390 744.8 0.0 no-sync
-
388 741.0 1020.4 fsync
388 741.0 1012.7 fsync
-
390 744.8 832.4 fdatasync
390 744.8 869.6 fdatasync
--- xfs and local devices
16 KB writes
w/s MB/s sync test
9866 146.9 0.0 no-sync
9730 145.0 0.0 no-sync
-
7421 110.6 39.5 fsync
7537 112.5 38.3 fsync
-
8100 121.1 29.0 fdatasync
8117 121.1 28.8 fdatasync
2 MB writes
w/s MB/s sync test
390 744.8 0.0 no-sync
390 744.8 0.0 no-sync
-
389 743.9 916.8 fsync
389 743.9 919.1 fsync
-
390 744.8 809.7 fdatasync
390 744.8 806.5 fdatasync
--- ext-4 and Hyperdisk Balanced
16 KB writes
w/s MB/s sync test
2093 31.2 0.0 no-sync
2068 30.8 0.0 no-sync
-
804 12.0 738.1 fsync
798 11.9 740.6 fsync
-
1963 29.3 46.8 fdatasync
1922 28.6 49.0 fdatasync
2 MB writes
w/s MB/s sync test
348 663.8 0.0 no-sync
367 701.0 0.0 no-sync
-
278 531.2 821.2 fsync
271 517.8 814.1 fsync
-
358 683.8 63.6 fdatasync
345 659.0 64.5 fdatasync
--- xfs and Hyperdisk Balanced
16 KB writes
w/s MB/s sync test
2033 30.3 0.0 no-sync
2004 29.9 0.0 no-sync
-
870 13.0 635.0 fsync
858 12.8 645.0 fsync
-
1787 26.6 46.0 fdatasync
1727 25.7 49.6 fdatasync
2 MB writes
w/s MB/s sync test
343 655.2 0.0 no-sync
343 655.2 0.0 no-sync
-
267 511.2 778.9 fsync
268 511.2 774.7 fsync
-
347 661.8 51.2 fdatasync
336 642.8 54.4 fdatasync
Results: hetz
Summary
- this has an enterprise SSD with excellent (low) sync latency
Legend:
- w/s - writes/s
- MB/s - MB written/s
- sync - latency per sync (fsync or fdatasync)
16 KB writes
w/s MB/s sync test
37700 561.7 0.0 no-sync
37500 558.9 0.0 no-sync
-
25200 374.8 12.4 fsync
25100 374.8 12.4 fsync
-
27600 411.0 0.0 fdatasync
27200 404.4 9.8 fdatasync
2 MB writes
w/s MB/s sync test
1833 3497.1 0.0 no-sync
1922 3667.8 0.0 no-sync
-
1393 2656.9 58.2 fsync
1355 2585.4 59.6 fsync
-
1892 3610.6 10.8 fdatasync
1922 3665.9 10.8 fdatasync
Results: ser7
Summary:
- this has a consumer SSD with high sync latency
- results had much variance (see the 2MB results below) and results at higher concurrency. This is a great SSD, but not for my use case.
Legend:
- w/s - writes/s
- MB/s - MB written/s
- sync - latency per sync (fsync or fdatasync)
16 KB writes
w/s MB/s sync test
34000 506.4 0.0 no-sync
40200 598.9 0.0 no-sync
-
325 5.0 2974.2 fsync
333 5.1 2867.3 fsync
-
331 5.1 2783.2 fdatasync
330 5.0 2796.1 fdatasync
2 MB writes
w/s MB/s sync test
362 691.4 0.0 no-sync
364 695.2 0.0 no-sync
-
67 128.7 10828.3 fsync
114 218.4 5396.8 fsync
-
141 268.9 3864.0 fdatasync
192 368.1 3508.2 fdatasync
Results: socket2
Summary:
- this has an enterprise SSD with excellent (low) sync latency after small writes, but fsync latency after 2MB writes is much larger
Legend:
- w/s - writes/s
- MB/s - MB written/s
- sync - latency per sync (fsync or fdatasync)
16 KB writes
w/s MB/s sync test
49500 737.2 0.0 no-sync
49300 734.3 0.0 no-sync
-
44500 662.8 1.6 fsync
45400 676.2 1.5 fsync
-
46700 696.2 0.7 fdatasync
45200 674.2 0.7 fdatasync
2 MB writes
w/s MB/s sync test
707 1350.4 0.0 no-sync
708 1350.4 0.0 no-sync
-
703 1342.8 139.1 fsync
703 1342.8 122.5 fsync
-
707 1350.4 2.2 fdatasync
707 1350.4 2.1 fdatasync