（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41321063

为了确保云系统中高效的文件读写，了解存储设备的特性至关重要。由于延迟和排队事务等因素，硬盘驱动器性能差异很大，尤其是在处理较小、分散的请求时。这些变化会导致不可预测的性能问题。对于连续工作负载，现代磁性硬盘驱动器的读取或写入速度可以超过 100MB/s。但是，在处理完全随机的 4KB 工作负载时，它们的速度可能会限制为低至 400kB/s。多租户系统很难管理性能不一致，尤其是读取操作，因为数据无法轻松写入其他地方。为了找出需要改进的领域，必须对当前问题进行全面分析并确定优先顺序。可视化工具有助于提供以前隐藏的见解，使团队能够查明特定的性能问题并制定优化策略。在 2013 年的重要时期，Amazon Web Services (AWS) 成功为每台服务器配备了单个固态硬盘 (SSD)。这一成就证明了设计可扩展系统的重要性，并在日常维护期间最大限度地减少停机时间，因为无缝更换旧组件和升级软件的能力在这一举措的成功实施中发挥了关键作用。最终，构建分布式系统不仅仅是为了实现规模化；更是为了实现规模化。相反，即使面临组件故障或需要对基础设施进行大规模修改，它也能确保平稳运行。通过确保能够适应正常和异常情况的稳健设计，即使面临看似难以克服的挑战，大型项目也可以变得可行。一个说明性的故事涉及在 Reddit 等平台上使用弹性块存储 (EBS) 的早期挑战，强调了配置不当的后果以及从过去的错误中学习的好处。

Super cool to see this here. If you're at all interested in big systems, you should read this.

> Compounding this latency, hard drive performance is also variable depending on the other transactions in the queue. Smaller requests that are scattered randomly on the media take longer to find and access than several large requests that are all next to each other. This random performance led to wildly inconsistent behavior.

The effect of this can be huge! Given a reasonably sequential workload, modern magnetic drives can do >100MB/s of reads or writes. Given an entirely random 4kB workload, they can be limited to as little as 400kB/s of reads or writes. Queuing and scheduling can help avoid the truly bad end of this, but real-world performance still varies by over 100x depending on workload. That's really hard for a multi-tenant system to deal with (especially with reads, where you can't do the "just write it somewhere else" trick).

> To know what to fix, we had to know what was broken, and then prioritize those fixes based on effort and rewards.

This was the biggest thing I learned from Marc in my career (so far). He'd spend time working on visualizations of latency (like the histogram time series in this post) which were much richer than any of the telemetry we had, then tell a story using those visualizations, and completely change the team's perspective on the work that needed to be done. Each peak in the histogram came with it's own story, and own work to optimize. Really diving into performance data - and looking at that data in multiple ways - unlocks efficiencies and opportunities that are invisible without that work and investment.

> Armed with this knowledge, and a lot of human effort, over the course of a few months in 2013, EBS was able to put a single SSD into each and every one of those thousands of servers.

This retrofit project is one of my favorite AWS stories.

> The thing that made this possible is that we designed our system from the start with non-disruptive maintenance events in mind. We could retarget EBS volumes to new storage servers, and update software or rebuild the empty servers as needed.

This is a great reminder that building distributed systems isn't just for scale. Here, we see how building the system in a way that can seamlessly tolerate the failure of a server, and move data around without loss, makes large scale operations (everything from day-to-day software upgrades to a massive hardware retrofit project) possible that just wouldn't be possible in a "simpler" architecture. A "simpler" architecture would make these operations much harder, to the point of being impossible, making the end-to-end problem we're trying to solve for the customer harder.

A picture can be worth way more than a thousand words, but sometimes you have to iterate through a thousand pictures to find the one that tells the right story, or helps you ask the right question!

Ah, this brings back memories. Reddit was one of the very first users of EBS back in 2008. I thought I was so clever when I figured out that I could get more IOPS if I build a software raid out of five EBS volumes.

At the time each volume had very inconsistent performance, so I would launch seven or eight, and then run some each write and read loads. I'd take the five best performers and then put them into a Linux software raid.

In the good case, I got the desired effect -- I did in fact get more IOPS then 5x a single node. But in the bad case, oh boy was it bad.

What I didn't realize was that if you're using a software raid, if one node is slow, the entire raid moves at the speed of the slowest volume. So this would manifest as a database going bad. It took a while to figure out it was the RAID that was the problem. And even then, removing the bad node was hard -- the software raid really didn't want to let go of the bad volume until it could finish writing out to it, which of course was super slow.

And then I would put in a new EBS volume and have to rebuild the array, which of course it was also bad at because it would be bottlenecked on the IOPS for the new volume.

We moved off of those software raids after a while. We almost never used EBS at Netflix, in part because I would tell everyone who would listen about my folly at reddit, and because they had already standardized on using only local disk before I ever got there.

And an amusing side note, when AWS had that massive EBS outage, I still worked at reddit and I was actually watching Netflix while I was waiting for the EBS to come back so I could fix all the databases. When I interviewed at Netflix one of the questions I asked them was "how were you still up during the EBS outage?", and they said, "Oh, we just don't use EBS".

> Ah, this brings back memories. Reddit was one of the very first users of EBS back in 2008. I thought I was so clever when I figured out that I could get more IOPS if I build a software raid out of five EBS volumes.

Hey! We also did that! It turned out, that eventually you hit the network bandwidth limit. I think, the performance topped out at around 160 megabytes per second for most of the instance types back then.

It's cool to read this.

One interesting tidbit is that during the period this author writes about, AWS had a roughly 4-day outage (impacted at least EC2, EBS, and RDS, iirc), caused by EBS, that really shook folks' confidence in AWS.

It resulted in a reorg and much deeper investment in EBS as a standalone service.

It also happened around the time Apple was becoming a customer, and AWS in general was going through hockey-stick growth thanks to startup adoption (Netflix, Zynga, Dropbox, etc).

It's fun to read about these technical and operational bits, but technical innovation in production is messy, and happens against a backdrop of Real Business Needs.

I wish more of THOSE stories were told as well.

It was a good year after that incident. We focused on stability and driving down issues. We turned around a lot of development idea too. However, the wheel turns and we were back on feature development. I’ll always remember that year as having the fewest escalations during my entire time there.

This is the bit I found curious: "adding a small amount of random latency to requests to storage servers counter-intuitively reduced the average latency and the outliers due to the smoothing effect it has on the network".

Can anyone explain why?

I liked the part about them manually retrofitting an SSD in every EBS unit in 2013. That looks a lot like a Samsung SATA SSD:

https://www.allthingsdistributed.com/images/mo-manual-ssd.pn...

I think we got SSDs installed in blades from Dell well before that, but I may be misremembering.

I/O performance was a big thing in like 2010/2011/2012. We went from spinning HDs to Flash memory.

I remember experimenting with these raw Flash-based devices, no error/wear level handling at all. Insanity, but we were all desperate for that insane I/O performance bump from spinning rust to silicon.

It was only a handful of frankenracks. It was challenging and not very performant, but it let everyone get a jump on the research. Disk speed was increasing so fast in six months the first SKU was out of date. I’m glad I didn’t have to make the argument directly to assets when we retired those racks years earlier than planned. The rack positions were so much more valuable with the new denser and faster models.

At the very start of my career, I got to work for a large-scale (technically/logistically, not in staff) internet company doing all the systems stuff. The number of lessons I learned in such a short time was crazy. Since leaving them, I learned that most people can go almost their whole careers without running into all those issues, and so don't learn those lessons.

That's one of the reasons why I think we should have a professional license. By requiring an apprenticeship under a master engineer, somebody can pick up incredibly valuable knowledge and skills (that you only learn by experience) in a very short time frame, and then be released out into the world to be much more effective throughout their career. And as someone who also interviews candidates, some proof of their experience and a reference from their mentor would be invaluable.

Imagine you got your license and then tasked to make a crud service with some simple UI because that is what is needed for the client and they cannot use unlicensed developers.

That wouldn't happen. Professional licenses vary by trade, state, cost of project, size of project, impact of work, etc, etc. If it's trivial, you don't need anything. If it is critical, you need a bunch. If it's in between, it depends. The world isn't black and white.

Imagine some military institution or government body needs a piece of custom but trivial software. I doubt they will be able to hire contractor with unlicensed developers to do the job.

Sure they will - the body shop guys still do the build, the staff developer still does the design and the project manager herds the cats, but they'll add a licensed software engineer to review the design and the load bearing parts of the build.

You'd also see service providers start selling pre-PE stamped services - which is hardly different from the whole "go buy your SaaS from X who is (compliance regime) authorized" today. Just like today when you build a house, your general contractor goes and buys a pre-engineered truss and hires a bunch of laborers to put it up per spec.

This gives me fond memories of building storage-as-a-service infrastructure back before we had useful opensource stuff, moving away from sun san, fibrechannel and solaris we landed on glusterfs on supermicro storage servers, running linux and nfs. We peaked almost 2Pb before I moved on in 2007.

Secondly it reminds me of the time when it simply made sense to ninja-break and rebuild mdraids with ssds in-place of the spinning drives WHILE the servers were running (sata kind of supported hotswapping the drives). Going from spinning to ssd gave us a 14x increase in IOPS in the most important system of the platform.

Loved this:

> While the much celebrated ideal of a “full stack engineer” is valuable, in deep and complex systems it’s often even more valuable to create cohorts of experts who can collaborate and get really creative across the entire stack and all their individual areas of depth.

The first diagram in that article is incorrect/quite outdated. Modern computers have most PCIe lanes going directly into the CPU (IO Hub or "Uncore" area of the processor), not via a separate PCH like in the old days. That's an important development for both I/O throughput and latency.

Otherwise, great article, illustrating that it's queues all the way down!

Thanks for the comment, and you're right, modern computers do have a much better architecture! As I was laying out the story I was thinking about what it looked like when we started. I'll clarify that in the image caption that it's from that era.

We may be going full circle with consumer systems being so light for lanes under PCI-e gen5!

There's usually enough for a GPU, SSD or two... and that's about it. I don't like having to spend so much for fast IO, dangit.

Can sometimes find boards that do switching to appease :/

I'd say the lack of flexible boards isn't really about lane count.

A B650(E) board has enough lanes to give you a fast GPU slot, four CPU x4 slots running at 8-16GB/s each, and four chipset x1 slots running at 2GB/s each.

That time was probably when SLI/crossfire was still around so high-end consumer boards had a reason to include plenty of slots for 3 GPUs.

The industry has made a market segment for this/largely ignored it, 'HEDT'. I'd say this is evidence to your point, GPUs being the motivator. I'm starting to think I'm the only one still using add-in NICs, sound cards, or bifurcation.

The ATX spec allows for 7 or fewer physical slots. That's not what's letting me down!

How many lanes are actually available and their division changes over time. Everything from USB to SATA has internally been moving to PCI-e, there's more 'internal consumption', too.

Gen5 is kind of a victim of success. Each individual lane is very fast, but providing them to a fair number of add-in cards/devices isn't practical. Simply physically, but also bandwidth between lanes.

Pretty much all of gen4 is nice, those systems are perfectly viable. More lanes that were slower offered me more utility. It's multi-faceted, I overstate it too. I like hyperbole :)

Early on, the cloud's entire point was to use "commodity hardware," but now we have hyper-specialized hardware for individual services. AWS has Graviton, Inferentia and Tranium chips. Google has TPUs and Titan security cards, Azure uses FPGA's and Sphere for security. This trend will continue.

The cloud's point was to be a computing utility, not to use commodity hardware. They may have been using commodity hardware but it was just a means to an end. That was also the era that "commodity hardware" was a literal game changer for all forms of large-scale computing for businesses, as before then you'd be beholden to an overpriced, over-valued proprietary computing vendor with crap support and no right to fix or modify it yourself. But anyway, the businesses you're referencing are literally the largest tech companies in the world; custom hardware is pretty normal at that size.

It was more of an opportunity initially. Bezos saw a chance to purchase extremely discounted systems. He had the foresight to move redundancy from a single host to clusters and then eventually data center level redundancy. This decision reshaped how services were implemented and scaled. Specialized devices returned because they offer a better value or in some cases enabled products that were never possible before. At the end of the day a lot of decisions are based around cost. Can you do it with a toaster or an oven cheaper? Can we cut all of our food into pieces and cook it in a Beowulf cluster of toasters? I think you get the idea.

The trend was to begin to use distributed software in computing clusters that allowed lower cost with commodity machines with lower individual reliability.

Now the focus is on cost-efficient machines in clusters, which is why we have the specialization.

You must be talking about very early on because it would only have taken a short time spent on practical cloud building to begin realizing that much or even most of what is in "commodity hardware" is there to serve uses cases that cloud providers don't have. Why do servers have redundant power supplies? What is the BMC good for? Who cares about these LEDs? Why would anyone use SAS? Is it very important that rack posts are 19 inches between centers, or was that a totally arbitrary decision made by AT&T 90 years ago? What's the point of 12V or 5V intermediate power rails? Is there a benefit from AC power to the host?

You're not wrong but I would make a distinction between removing features (which requires little or no R&D and saves money immediately) and designing custom ASICs (which requires large upfront R&D and pay off only over time and at large scale).

> realizing that much or even most of what is in "commodity hardware" is there to serve uses cases that cloud providers don't have.

Why wouldn't they have?

E.g.

> Why do servers have redundant power supplies?

So if you lose power you don't lose the whole server? It's even more important for cloud providers that have huge servers with high density. You connect the different power supplies each to an independent power feed so 1 can go down. Would you rather have 2x the capacity instead?

> hyper-specialized hardware for individual services > AWS has Graviton

This is commodity hardware.

It pretty much works off ARM spec. Besides AWS owning it and offering it cheaper it's not in any way faster / better than Intel / AMD / something else.

We've had custom motherboards, server cases, etc long ago even before clouds.

If Apple silicon happens in the cloud then maybe...

What's the best way to provide a new EC2 instance with a fast ~256GB dataset directory? We're currently using EBS volumes but it's a pain to do updates to the data because we have to create a separate copy of the volume for each instance. EFS was too slow. Instance storage SSDs are ephemeral. Haven't tried FSx Lustre yet.

Instance storage can be incredibly fast for certain workloads but it's a shame AWS doesn't offer instance storage on Windows EC2 instances.

Instance storage seems to only be available for (large) Linux EC2 instances.

Instance storage can be a good option depending on your workload, but definitely has limitations. There's huge value in separating the lifecycle of storage from compute, and EBS provides higher durability than instance storage as well.

There are no operating system limitations that I'm aware of, however. I was just able to launch a Windows m6idn.2xlarge to verify.

Thanks for checking. I realize now that I wasn’t clear in my original comment.

My use case was to bring up a Windows instance using instance storage as the root device instead using of EBS which is the default root device.

I wanted to run some benchmarks directly on drive C:\ — backed by an NVMe SSD-based instance store — because of an app that will only install to drive C:\, but it seems there’s no way to do this.

The EC2 docs definitely gave me the impression that instance storage is not supported on Windows as a root volume.

Here’s one such note from the docs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/RootDevi...

”Windows instances do not support instance-store backed root volumes.”

Really nice that you are engaging with the comments here on HN as the article’s author.

(For others who may not be aware, msolson = Marc Olson)

Ah yes. Instance store root volumes are what we originally launched EC2 with--EBS came along 2 years later, but as data volumes only at first, with boot from EBS I think a year after that. There's a lot less fragility, and really it's just easier for our customers to create an EBS snapshot based AMI.

Before we launched the c4 instance family the vast majority of instance launches were from EBS backed AMIs, so we decided to remove a pile of complexity, and beginning with the c4 instance family, we stopped supporting instance storage root volumes on new instance families.

Impressive, but I don't think we every determined conclusively whether our EFS problems were caused by throughput or latency.

Also, throughput is going to be limited by your instance type, right? Though that might also be the case for EBS. I can't remember. Part of the problem is AWS performance is so confusing.

I think that depends a lot on how the data is going to be used. It sounds like you're not really using EBS volumes for what they're great at; Durability.

While instance storage is ephemeral nothing really stop you from using it as a local cache in a clustered filesystem. If you have a somewhat read intensive workload then you might see performance close to matching that of using instance storage directly.

There are some fundamental limits to how fast a clustered filesystem can be, based on things like network latency and block size. Things like locking is an order of magnitude slower on a clustered filesystem compared to locally attached storage.

I think FSx/Lustre would be the simplest assuming you're using Linux instances. You can definitely scale to 256GB/s or more. Plus, you can automatically hydrate the dataset from s3. Lustre is common for HPC and ML/AI, so if you doing anything along those lines it's a good fit.

So true and valid of almost all software development:

> In retrospect, if we knew at the time how much we didn’t know, we may not have even started the project!

The most surprising thing ia that the author had no previous experience in the domain. It's almost impossible to get hired at AWS now without domain expertise, AFAIK.

At least in the organizations I'm a part of this isn't true. We do look for both specialists and generalists, and focus on experience and how it could apply.

It's difficult to innovate by just repeating what's been done before. But everything you learn along the way helps shape that innovation.

Great read, although a shame that it didn't go any further than adding the write cache SSD solution, which must have been many years ago. I was hoping for a little more recent info on the EBS architecture.

Great article.

"EBS is capable of delivering more IOPS to a single instance today than it could deliver to an entire Availability Zone (AZ) in the early years on top of HDDs."

Dang!

I think the most fascinating thing is watching them relearn every lesson the storage industry already knew about a decade earlier. Feels like most of this could have been solved by either hiring storage industry experts or just acquiring one of the major vendors.

There are substantial differences in developing hyperscale storage from the systems that were built previously. But note that many of the architects of these systems were previously storage industry experts, and acquiring an existing vendor would not have been an asset to AWS since these new systems had a wide range of issues that the vendors never had to solve.

Your comment elsewhere about NetApp solving all known problems with WAFL. Hahahaha. Have you tried deleting a 5TB file in a filesystem at 95% capacity with snapshots enabled?

>Your comment elsewhere about NetApp solving all known problems with WAFL. Hahahaha.

I guess it's a good thing I didn't say NetApp solved "all known problems with WAFL" - but critical reading would've required you to respond to the content of the post, not provide an inaccurate summary to make a point that wasn't there.

What I DID say is that NetApp solved the issue of spewing random small writes all over the disk, resulting in horrendous performance on subsequent reads from spinning disk.sace reclamation takes a while. What's your point, assuming you had one? Because deleting a 5TB file on a filesystem 95% full with lots of snapshots is a daily workflow for... nobody? And if it is: there are countless ways to avoid that situation, but I assume you knew that too?

Some of it, like random IOPS, spindle bias etc...was well known.

Well among implementers, vendors were mostly locked into the vertical scaling model.

I ran a SGI cluster running CXFS in 2000 as an example, and by the time EBS launched, I was spending most of my SAN architect time trying to get away from central storage.

There were absolutely new problems and amazing solutions by the EBS team, but there was information.

Queue theory was required for any meaningful SAN deployment as an example and RPM/3600 had always been a metric for HD performance under random.

Not that everyone used them, but I had to.

>What is there to learn from an "storage industry expert" or major vendors?

I mean, literally every problem they outlined.

>Compounding this latency, hard drive performance is also variable depending on the other transactions in the queue. Smaller requests that are scattered randomly on the media take longer to find and access than several large requests that are all next to each other. This random performance led to wildly inconsistent behavior. Early on, we knew that we needed to spread customers across many disks to achieve reasonable performance. This had a benefit, it dropped the peak outlier latency for the hottest workloads, but unfortunately it spread the inconsistent behavior out so that it impacted many customers.

Right - which we all knew about in the 90s, and NetApp more or less solved with WAFL.

>We made a small change to our software that staged new writes onto that SSD, allowing us to return completion back to your application, and then flushed the writes to the slower hard disk asynchronously.

So a write cache, which again every major vendor had from the beginning of time. NetApp used NVRam cards, EMC used dedicated UPSs to give their memory time to de-stage.

Etc. etc.

>network attached block level storage at AWS's scale hasn't been done before.

This is just patently false. It's not like EBS is one giant repository of storage. The "scale" they push individual instances to isn't anything unique. The fact they're deploying more pods in totality than any individual enterprise isn't really relevant beyond the fact they're getting even greater volume discounts from their suppliers. At some point whether I'm managing 100 of the same thing or 1,000 - if I've built proper automation my only additional overhead is replacing failed hardware.

Downvote away, watching HN think that re-inventing the wheel instead of asking someone who has been there already what the landmines are seems to be a common theme.

I'm guessing that at cloud scale, more innovation & scalability is needed for the control plane (not to mention the network itself).

Regarding a durable/asynchronously destaged write cache, I think EMC Symmetrix already had such a feature in the end of '80s or 1990 (can't find the source anymore).

> whether I'm managing 100 of the same thing or 1,000 - if I've built proper automation my only additional overhead is replacing failed hardware

Hahahah surely this is a joke, right?

If it’s so easy and you already had solved all these problems, why didn’t someone already build it? Why didn’t you build EBS, since you apparently have all the answers?

For the right kind of people, it's tremendously satisfying for themselves to rethink and independently rediscover the best solutions. If you hire an industry expert that already knows everything, asking them to design the same things and write the same code again is not satisfying at all.

Ok, but we're running a business here not catering to the satisfaction of random programmers. The NIH part of this stuff is a bit weird.

Turns out that if you don't cater to the satisfaction of programmers, you often kill the golden goose, and then you end up on the sidelines saying "oh, you shouldn't have done it that way" while someone else is succeeding.

People often talk about standing on the shoulders of giants. To extend that metaphor, the only ones standing there are the ones who climbed the giant. If you hire a college grad and stick them on that giant's shoulder, it's not the same.

Craft? I’ll take craft any day.

The downward slide of most great tech companies starts with managers finally losing patience with the engineers. Check out Boeing as a recent example. It’s not about art or craft, it’s about treating people as more than widget producers.

Sure but that assumes the company itself has sufficient attraction that can overpower the lack of job satisfaction. Maybe you pay more. Maybe your brand is more attractive thanks to PR. Maybe you offer more perks. Maybe the product itself is more appealing. Maybe it's something else. You have to have some other advantage to retain people who aren't satisfied in their jobs.

（评论） (comments)

（评论）
(comments)