从22天存储漏洞中学到的经验（以及我们如何修复它）

从22天存储漏洞中学到的经验（以及我们如何修复它）
What we learned from a 22-Day storage bug (and how we fixed it)

原始链接: https://www.mux.com/blog/22-day-storage-bug

## Mux 视频事件总结 (1 月 8 日 - 2 月 4 日) 1 月 8 日至 2 月 4 日期间，Mux 视频发生一起事件，影响了大约 0.33% 的视频和音频片段，导致部分观众出现短暂的音频中断或视觉卡顿。没有视频数据丢失，所有受影响的素材最终都已修复。根本原因源于最近存储系统更新相关的一系列因素。此次更新旨在提高扩展性和性能，但引入了流量高峰期间的瓶颈。具体而言，文件删除过程中的竞争条件，以及中介文件读取期间的上下文取消，导致生成并提供损坏的片段。 Mux 通过修复删除过程、调整远程读取处理方式以及增加存储节点容量来解决该问题。触发了受影响片段的完全重新生成，并清除了 CDN 缓存，以确保观众接收到更正后的内容。该事件凸显了 Mux 监控、日志记录和升级流程中需要改进的方面。他们正专注于增强转码管道内的可观察性，改进错误检测，并简化支持升级，以主动识别和解决问题。Mux 强调透明度，并致力于防止类似事件再次发生。

## Mux 存储漏洞与经验教训 Mux 最近的一次 22 天漏洞源于一个日志记录问题，其日志代理的速率限制导致日志被静默丢弃。缺乏数据阻碍了调试工作，强调了在使用 Grafana 等日志服务时，监控代理级别和提供商级别的速率限制的重要性。讨论还涉及 Mux 独特的视频转码方法。Mux 不会预先转码所有内容，而是根据请求“实时”转码视频。这可以节省 CPU 和存储成本，特别是由于许多视频从未被观看或仅在上传后不久被观看，从而可以删除未使用的渲染版本。 Mux 创始人 Jon Dahl 解释说，这使得发布时间非常快（中位数 9 秒），并利用访问日志来预测编码需求。该系统有效地缓存编码文件，根据需要驱逐它们，并根据客户端请求动态调整编码。

原文

One of Mux Video’s most distinguishing features is the ability to Just-In-Time transcode video segments (and thumbnails, storyboards, etc.) during playback. It’s key to our goal of making any uploaded content viewable as quickly as possible, and our customers rely on it to create snappy experiences for their users.

Building a video platform that can do this requires a lot of moving parts: workers handling the actual encoding, storage and replication, low-latency transmission and streaming of segments as we generate them, and CDN caching and distribution to name a few. Of course doing this at scale means doing all of the above and more in a highly distributed system, which inevitably invites our friend Murphy and his ever-present law to the party.

Let’s talk about why we’re here. Between January 8th and February 4th, roughly 0.33% of audio and video segments across all VOD assets played back during this timeframe were served in a corrupted state. The ensuing behavior likely varied between players and depended on the degree to which the segments were incomplete, but in general some viewers experienced brief audio dropouts or visual stuttering during playback. No source video data was lost and all affected assets have been fully remediated.

Nobody likes incidents, and unfortunately nobody is immune to them. We take every incident seriously but this one in particular had a combination of wide-ranging impact and duration that fell short of our standards. We've fixed the immediate causes and remediated every affected asset, but we're still investigating exactly why our systems behaved the way they did under load. We're sharing what we know now because we believe transparency matters more than having all the answers.

You should never have to worry about Mux internals if you're building on our platform, but the challenges here are interesting and provide us an opportunity to be honest about what we're doing to improve.

To set the stage for what went wrong, it helps to know a bit about how our storage and transcoding systems interact.

When we encode renditions for streaming, we read the source frames from a higher-quality source file we store internally and refer to as the “mezzanine.” Segments typically get generated in parallel, so our encoders are often concurrently reading overlapping portions of the same mezzanine file.

At playback time, our HLS delivery services will request a particular segment from our storage system. If it doesn’t exist, a request will be made to our JIT services to generate the segment while the delivery service waits to start receiving the segment data from our storage system. Here is a very high level visual of the request flow:

从22天存储漏洞中学到的经验（以及我们如何修复它）
What we learned from a 22-Day storage bug (and how we fixed it)

Tomfoolery on remote reads

Deletions gone awry

Fewer nodes, ‘mo problems

A curious case of corrupted segments

Stopping the bleeding

Regenerating every affected segment

Clearing the CDN caches

从22天存储漏洞中学到的经验（以及我们如何修复它） What we learned from a 22-Day storage bug (and how we fixed it)

LinkTomfoolery on remote reads

LinkDeletions gone awry

LinkFewer nodes, ‘mo problems

LinkA curious case of corrupted segments

LinkStopping the bleeding

LinkRegenerating every affected segment

LinkClearing the CDN caches

从22天存储漏洞中学到的经验（以及我们如何修复它）
What we learned from a 22-Day storage bug (and how we fixed it)

Tomfoolery on remote reads

Deletions gone awry

Fewer nodes, ‘mo problems

A curious case of corrupted segments

Stopping the bleeding

Regenerating every affected segment

Clearing the CDN caches