DWPD 仍然是一个有用的固态硬盘规格吗？

原文

For those not already familiar, DWPD is an initialism for (total) Drive Writes Per Day, and is intended as a specification for SSD write endurance.

With very few exceptions, all solid-state drives currently available for purchase are based on NAND flash. At the very lowest level, they work by charging an individual cell to indicate that it stores a 1, or discharging that cell to indicate that it stores a 0.

(If you’re already bristling because we’ve just described SLC flash and most modern drives are TLC or higher, don’t fret–we’ll get to that in a bit.)

Compared to mechanical hard drives, NAND flash SSDs offer incredibly increased IOPS, lower latency, greater resistance to fragmentation-related performance issues, and more. They do, however, introduce a few new problems, one of which is limited write endurance.

We don’t think we can really do the DWPD question justice without covering a fair bit of information on the hows and whys of drive failures first–so buckle up!

Fail Different

With mechanical hard drives, failure is typically not due to degradation of the media itself–it is more commonly either a mechanical issue (such as a seized bearing or a crashed head), an electrical issue (such as a motor failure due to a short in the windings), or a controller failure.

Although SSDs largely do away with both the mechanical and electrical issues, they are still quite subject to controller failure–and they also introduce a couple of new potential issues with the media itself.

As the NAND flash media ages, it becomes both slower to charge and more difficult to accurately charge to a precise desired voltage. In addition, NAND flash cells cannot maintain a charge unaided forever–the voltage in a charged cell slowly decreases over time, which can change the apparent stored value in that cell–and age and use can also cause this problem to accelerate.

It may be helpful at this point to list and briefly describe common failure modes and their symptoms.

Mechanical HDD Failure Modes

For the greybeards among us, it might be helpful to remind ourselves of the ways that mechanical drives fail… and for the younger crowd, it might be helpful to hear about them potentially for the first time!

Controller failure: entire drive suddenly drops off the SATA/SAS bus, despite the motor spinning up just fine
Head failure: one or more drive heads may crash (damaging the media) after a violent physical event, or may become deranged and begin returning bad data (resulting in I/O errors). May or may not be accompanied by repeated “clunk” noise with attempted seeks
Motor failure: the drive may or may not still appear on the SATA/SAS bus, but audibly no longer spins up when powered on, or in some cases, spins up but at the wrong speed, which will generally be accompanied by lots of failed head-seek “clunks”
Media failure: especially with advanced age, an occasional sector may become unreadable due to a microscopic fault in the media at a physical location. This results in I/O errors when attempting to access the bad sector(s), and/or the drive remapping those sector(s) to a different physical area of the drive reserved for exactly this purpose

In many, if not most, cases, when a mechanical hard drive suddenly fails, the data is technically still there and can be retrieved. This process is called forensic retrieval, and generally involves opening the failed drive up in a surgical cleanroom, removing its platters, and reinserting them in a working drive of the same make and model.

After the physical part of a forensic retrieval, surgery on a partially corrupt filesystem is generally also required. All in all, the cost of this sort of data retrieval typically starts at $800 per drive, for a fairly simple case, and can easily cost more, especially in the case of complicating factors like RAID or other multiple-drive distributed systems!

NAND Flash SSD Failure Modes

This brings us back to the main topic of today’s article, NAND flash SSDs. This is the storage mechanism found in nearly every consumer device on the market today, ranging from eMMC flash found in smartphones, tablets, and IoT gadgets through SATA and NVMe M.2 consumer SSDs, all the way up to massive enterprise-targeted and hot-swappable NVMe U.2 drives.

These drives are logically more complex than mechanical HDDs, but they’re far simpler otherwise. Here are the typical failure modes for modern SSDs:

Controller failure: by far the most common catastrophic failure you’ll see in SSDs, and it looks just like the controller failures we saw in physical HDDs: typically, the entire drive simply drops off the bus as though it were no longer there.
Failure to retain charge: typically, only seen in SSDs, thumb drives, and similar devices left unpowered for long periods of time. The longer NAND flash is left unpowered, the more likely that slow charge dissipation will make it difficult or impossible to figure out the exact charge state each cell is intended to have, and therefore the data it stores.
Write endurance exhaustion: each NAND flash cell degrades slightly each time it is charged and discharged (this is referred to as a P/E cycle, for Program/Erase). Eventually, it becomes so difficult to store precisely the charge level you’d like to store that the drive effectively becomes “read only.” In practice, the performance has typically degraded far enough to prompt replacement long before a true “read-only” state is reached.

One argument proponents of mechanical hard drives love to repeat is that when your hard drive fails, you can send it off to an outfit like DriveSavers for forensic recovery. (Most of the people making that argument, in our opinion, have never actually done this or paid that forensic recovery invoice, but we digress.)

In fact, SSDs are also candidates for forensic recovery–just as the platters in a mechanical HDD can be removed and reinstalled in a mechanically functional drive, the media in an SSD with a failed controller can be reattached to a functioning controller.

Write Endurance Exhaustion

Now that we’ve got a better idea of how and why drives can fail, let’s talk about write endurance exhaustion in NAND flash SSDs.

As we covered earlier, each NAND flash cell is only rated for a certain, limited number of total P/E cycles. With each successive cycle, that cell gets a bit “sloppier”—slower and less accurate to charge, and frequently slower and less accurate to read as well.

How Symptoms Manifest

As the cells get sloppier, we see this at first as a simple loss of performance–sloppy cells take longer to charge and may not retain the correct value on the first attempt, requiring the controller’s continued attention until the proper voltage level is set and retained.

We can and usually do see a loss of read performance, as well–the older, sloppier cells are likely to return values “close to” what they ought, which may be misinterpreted as an adjacent value–for example, is a cell charged to 40% of its rail voltage intended to return a 0, or a 1?

Much like mechanical hard drives, SSDs feature relatively simple hardware-level CRC error detection. This error detection is far from foolproof, but it’s generally sufficient to detect hundreds or thousands of errors properly in between hash collisions. When the controller gets a CRC error, it re-reads the cells in question and tries again. If it gets a CRC pass after a re-read, it returns what it hopes is the correct data to the system.

Ideally, a controller which detects a CRC failure will also refresh the cells in question–erase them, and rewrite them (or a different set of physical cells, logically mapped to replace the failing ones) with the CRC-verified values. A good controller will also recognize when a charge level has been maintained for longer than spec recommends and automatically erase and reprogram the correct value into that (or a different, see above) cell, to mitigate charge dissipation issues, which also get worse, as write endurance approaches exhaustion.

Capacity vs Longevity: The Ugly Side

In the opening statements, we referred to a charged cell representing a one and a discharged cell representing a zero. This is a reasonably accurate description of how Single Level Cell (SLC) flash operates, but most of the NAND flash you encounter is not SLC!

The levels can be confusing because they refer to the number of bits stored per cell, not to the number of discrete charge levels necessary. A single-level cell requires two charge states, representing a single 0 or 1. By contrast, an MLC cell must keep track of four (2^2) charge states to store two bits, and a TLC cell must manage eight (2^3) charge states to store three bits:

Single Level (SLC): two charge states / values (0,1)
Multi (dual) Level (MLC): four charge states / values (0, 1, 2, 3)
Triple Level (TLC): eight charge states / values (0, 1, 2, 3, 4, 5, 6, 7)
Quad Level (QLC): sixteen charge states / values (0, 1, 2, 3 … 15)
Penta Level (PLC): thirty-two charge states / values (0, 1, 2, 3… 31)

The majority of modern NAND flash is TLC media, often with a small cache area of the same physical media being used in SLC mode (which is much, much faster).

Although SLC is the most performant, highest endurance, and most reliable encoding method, it’s expensive. By contrast, MLC offers literally double the storage that SLC does, in the same number of cells!

It’s very difficult to turn down the doubling of capacity you get moving from SLC to MLC, and even the 50% bump in capacity you get when moving from MLC to TLC is pretty compelling. But once you’re looking at a mere 33% bump from TLC to QLC–let alone the mere 25% you get when moving from QLC to PLC, the trade-offs get starker.

Worse, we’re seeing the ugly side of an exponential scale issue: in order to add each new bit, we have to double the number of charge states that we can discretely set and reliably identify.

Unsurprisingly, the more closely we need to set and examine the precise charge state of a cell, the longer each operation takes, the less reliable it is, and the larger impact that write endurance exhaustion (and charge dissipation over time) has on that cell!

Capacity vs Longevity: The Pretty Side

Ultimately, NAND flash write endurance is rated in P/E (Program/Erase) cycles per cell, and it does not really differ much from one manufacturer to another. The problem is, we need to manage endurance for an entire drive and its workload, not a single cell.

As we covered in the last section, the more data we expect each individual cell to store, the lower the performance and the shorter lifespan we can expect from that cell. But the reverse is true if we add more physical cells to the same drive!

The more physical cells a drive has on which to store data, the fewer P/E cycles are needed to write the same amount of data over the same number of years. In other words, within the same storage class (SLC, MLC, TLC), you can expect a larger drive to last longer!

This frequently means it’s a good idea to buy a much larger SSD than you might naively believe you need, simply by looking at the amount of data you permanently store. In many cases, the majority of the actual write operations committed to disk are ephemeral: they’re log file entries, temp files used by operating systems and applications, writes to swap space, and so forth.

Although these “ephemeral” writes–if properly managed–may never really get your attention in terms of drive capacity needed, since they (hopefully) disappear at roughly the same rate they’re written, they can easily form most of the actual wear and tear on your SSDs.

Real-World Expectations

In our rather extensive experience, the majority of consumer SSDs are replaced when the consumer needs more space than the original SSD offered… but they’re also noticeably less performant by then, and that loss of performance is the next most common reason for consumer SSDs to be replaced!

In particular, we see write endurance exhaustion as one of the most common reasons that smartphones, tablets, and cheap IoT gear gets replaced: after many years of program/erase cycles on the smallest viable amount of the cheapest possible eMMC flash, the device storage gets erratic.

If your last phone or tablet got really slow and unreliable in terms of booting up or opening and closing applications–despite generally doing okay running those applications, once they get going–you experienced write endurance exhaustion.

Similarly, a three-year-old laptop will generally exhibit markedly lower I/O performance–when actually tested–than it did when new, and this is also largely due to the beginnings of write endurance exhaustion.

Mitigating Write Endurance Problems

Now that we’ve established that write endurance is, in fact, a thing which can cause problems, and talked about some of the ways that it might be seen, let’s talk about how to avoid it.

Bit Density

Obviously, the higher the bit density–TLC vs MLC, or QLC vs TLC–the lower the write endurance, and the more rapidly the device will wear out. We can get an idea of how rapidly this occurs by looking at design P/E cycles for these various encoding methods.

According to this detailed comparison of SLC, MLC, TLC, QLC, and PLC NAND flash and MEMKOR’s evaluation of NAND technologies, SLC NAND can be expected to last for 30,000 to 100,000 P/E cycles, depending on grade, while MLC and TLC are generally expected to last for roughly 10K cycles, QLC is generally expected to last fewer than a thousand P/E cycles, and the rarely-seen PLC may only be good for a few hundred cycles!

There’s not much that can be done about the P/E cycle endurance limit of a NAND cell, which leaves us with three methods to extend drive longevity: wear leveling, usable capacity increase, and raw capacity increase.

Wear Leveling

Wear leveling simply tries to ensure that cells wear at roughly the same rate–an SLC SSD whose cells all have between 10k and 20k P/E cycles on them is almost certainly going to be in fine form, while one with several hundred thousand P/E cycles on a tiny fraction of the total number of cells will likely be kaput. The controller inside the SSD can move data from nearly worn-out cells to cells that have not been used up. This is part of why TRIM can be important to SSDs; knowing which cells contain active data and which are empty allows the controller to make better placement decisions.

Usable Capacity Increase

Next, we have capacity increase–although you might choose a 2TB SSD instead of a 512GB SSD for your laptop, and you might even keep more data because of it, all those repeated ephemeral writes and re-writes your operating system and application are doing in the background aren’t also going to quadruple in size!

Raw Capacity Increase

Finally, we have raw–or perhaps we might even say hidden–capacity increase. Have you ever noticed how a consumer SSD and an enterprise SSD from the same manufacturer tend to be 512GB and 480GB, respectively? That’s because the heavy-duty devices reserve some of their total capacity to make sure the wear leveling algorithm always has plenty of space to work with.

This both increases write endurance due to the raw capacity upgrade–more cells to split the P/E cycles among means fewer P/E cycles per cell–and decreases the likelihood of the wear leveling algorithm having to keep re-using badly over-worked cells, because the less-worn ones are all in use with relatively static data.

Putting These Lessons to Use

At this point, it’s obvious that three major things determine an SSD’s effective lifetime: its per-cell bit depth, its raw capacity, and its ability to wear-level effectively.

In a PC that we build or repair, we can generally pick and choose among all of these factors: we might opt for a small, cheap TLC drive in a light-duty laptop, a larger “prosumer” TLC drive in a desktop with a heavier workload, and an over-provisioned SLC monster in a sky’s-the-limit production server or workstation.

But what about devices where you have little or limited input? Frequently, the only thing you can really control is the overall capacity–and these are often the devices where a wise choice matters most.

If you put the wrong SSD in your laptop, you can typically easily clone its data onto a newer, larger drive later. But if you choose the wrong storage in a phone, tablet, or IoT device, you typically have no recourse once that usually soldered-on storage begins to fail.

For example, when shopping for your favorite smartphone brand with a friend, you might choose the 64GB model (because you don’t save much to your phone) while your friend opts for the 512GB model. Most likely, your 64GB phone will begin displaying symptoms of balky, slowing storage within a couple of years, while your friend will have eight timesthe longevity before they begin to see the same issues!

Bringing It Back to DWPD

Now that we’ve got a pretty good idea of what write endurance exhaustion is, why it’s a problem, and what affects it, we can return to our original topic: the one longevity specification most storage vendors offer, which is Drive Writes Per Day.

Unfortunately, to the best of our knowledge, there is no well-defined industry-universal standard for this metric. You have to read the hardware specifications in full, including the definitions of terms, if you want to be absolutely certain what any given manufacturer is telling you about its drives!

However, for the most part, we can expect DWPD to represent a rated number of full drive writes per day over a five-year period. This may be considered an expanded metric–the raw metric is TBW, or the total number of TeraBytes Written over the lifespan of the drive–but the raw metric is frequently not directly offered, leaving the purchaser with a need to derive the raw TBW stat from the expanded DWPD stat.

Some Example Manufacturer Specs

Let’s look at some sample data. When we examine the Samsung 870 EVO consumer SSD datasheet, we see metrics in raw TBW, not DWPD: 150TB for the 250GB model, 300TB for the 500GB model, and an additional doubling in TBW for each doubling in capacity, all the way up to their 4TB model.

Regardless of size, this breaks down to 600 total drive writes, which over a five-year period comes out to roughly 0.3 DWPD.

If we look at the more-expensive Samsung 990 “Pro” M.2 drive–which, unlike the earlier SATA models of “Pro”, uses TLC flash, just like the EVO does–we get 600TBW for a 1TB drive and 1200TBW for a 2TB drive. This should look awfully familiar, because it’s the same 0.3 DWPD we saw for the much-cheaper EVO models!

Did we mention the earlier-model SATA “Pro” drives from Samsung? The 860 Pro, released in 2018, offered double the TBW at the same sizes, and therefore double the DWPD–600TB for a 512GB drive, working out to 0.64 DWPD. This is because those earlier drives used MLC flash–not TLC–and as we discussed earlier, lower bit densities mean higher performance and greater longevity.

Now, let’s take a look at the DC600M Series 2.5” SATA Enterprise SSD datasheet for one of my favorite enterprise-grade drives: Kingston’s DC600M. This drive offers endurance metrics in raw TBW, DWPD (five years), and DWPD (three years). For all sizes–ranging from 480GB to 7.68TB–these drives are rated for a full 1.0 DWPD over a five-year period–nearly double again what the older MLC Samsung Pros offered.

Sharp-eyed readers should have already noted the obvious: there’s more going on here than merely capacity or bit depth can account for. Although Kingston’s DC600M is 3D TLC like Samsung’s EVO (and newer “Pro”) models, it offers nearly double the endurance of Samsung’s older MLC drives, let alone the cheaper TLC! What gives?

What Does TBW or DWPD-Rated Endurance Actually Mean?

The answer, unfortunately, is largely “it’s up to you to figure that out.” Although enterprise-grade drives will generally specify that their endurance ratings are based on the JEDEC JESD219A.01 SSD Endurance Workload, what they don’t tell you is what actually determines the rating the manufacturer gave the drive based on those tests.

The good news is, the JEDEC workload is a pretty heavily specified, quasi-random read/write access pattern derived from storage traces on typical enterprise and/or client storage workloads–so writing an enormous amount of data using the JEDEC workload is probably a very good metric indeed for deciding how well the drive tested will stand up to real-world use.

The bad news is, the published document only describes the workload itself–not the failure conditions which indicate a drive being tested has reached its rated limit. This leaves manufacturers free to decide what that means for themselves.

Does this mean “half the drives tested went fully read-only or otherwise died by this time?” It could, but it usually doesn’t. Does it mean “half the drives tested exhibited significant performance degradation by this time?” Again, it could, but who knows for certain? With failure criteria left unspecified, manufacturers have an extremely broad range of excuses for picking essentially whatever number they believe they can get away with.

So, Is DWPD a Good Metric, or Not?

Ultimately, DWPD is just as good — and just as bad — as every other performance metric storage vendors have offered. Although it can be useful as a way to compare two different drives within the same manufacturer’s lineup, extreme caution should be used when attempting to directly compare different vendors’ DWPD ratings.

We also can’t really recommend using DWPD for its most obvious intended purpose–an estimation of how heavy a workload you can throw at a drive, irrespective of its capacity–for the same reason. Without real, demonstrably correct, and precisely defined metrics, OEMs can and do claim essentially whatever the marketing department thinks is likely to fly.

In actual practice, we recommend evaluating drive endurance first on bit depth and usable capacity, and second on apparent raw capacity and intended market. A drive that clearly reserved a large chunk of its raw media–for example, a 480GB enterprise drive from a vendor that offers 512GB consumer drives–is offering you a good indication not only of additional capacity, but also of more effective wear leveling.

When you really need to evaluate wildly unfamiliar drives in terms of their rated DWPD and your own knowledge of your own workload in TBW per day, we recommend serious caution. Just because a vendor warrants a drive to survive 1.0 DWPD for five years doesn’t necessarily mean that drive will perform at even half its original performance level by the end of that time!

Conclusions

Much like nearly any other storage vendor metric, DWPD is heavily gamed, overly lawyered, and poorly specified. It can be useful to compare rated DWPD to a well-understood workload, but we advise extreme caution in doing so.

One might, for example, very confidently expect 20GB per day to be written to a LOG vdev in a pool with synchronous NFS exports, and therefore spec a tiny 128GB consumer SSD rated for 0.3 DWPD.

On the surface, this seems more than fine: 0.3 * 128GB * 365 * 5 == 70TB, while 20GB * 365 * 5 == only 36.5TB. But in practice, most SSDs have already dropped to half or less of their original rated performance by half of their rated endurance. Will it really be okay if your LOG vdev is operating considerably less predictably and at half the rated speed in five years?

Perhaps it will, and perhaps it won’t–but it’s unfortunately on you, the administrator, to decide and respond accordingly.

Klara has deployed or maintained 10s of 1000s of SSD and NVMe storage devices across various industries and workloads. Our team can help you select the best hardware to achieve your storage performance goals, maintain and monitor it throughout its life span, and do a performance analysis to help you decide when it is time to replace worn-out flash devices.