案例研究：恢复一个损坏的12 TB多设备池

案例研究：恢复一个损坏的12 TB多设备池
Case study: recovery of a corrupted 12 TB multi-device pool

原始链接: https://github.com/kdave/btrfs-progs/issues/1107

本文详细描述了在一次意外断电后，成功恢复一个严重损坏的12TB Btrfs文件系统，该文件系统跨越3个设备池（数据单副本，元数据DUP，DM-SMR磁盘）的过程。标准的`btrfs check --repair`命令失败，由于extent树和空闲空间树的问题陷入无限循环。恢复是通过14个基于btrfs-progs API构建的定制C工具实现的，数据损失极小——大约4.59TB中的7.2MB（0.00016%）。作者分享此案例作为研究，*而非*错误报告，并为btrfs-progs的潜在改进提供建设性反馈。提出了九个具体的改进领域，重点是增强的修复工具功能（进度检测、extent树重建、孤立inode清理）、更清晰的文档以及对已识别边缘情况的修复。定制工具的参考实现以及一个补丁已在GitHub上公开提供，作为进一步调查和讨论的资源，而非直接提交补丁。

黑客新闻新的 | 过去的 | 评论 | 提问 | 展示 | 招聘 | 提交登录案例研究：恢复一个损坏的12 TB多设备池 (github.com/kdave) 11 分，由 salt4034 发表于 2小时前 | 隐藏 | 过去的 | 收藏 | 讨论帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

Hello, and thanks in advance for reading.

This is not a bug report. It is a case study write up of a recovery effort on a severely corrupted 12 TB multi-device pool, shared here in case any of the observations are useful to btrfs-progs development. The goal is constructive, not a complaint.

One paragraph summary

A hard power cycle on a 3 device pool (data single, metadata DUP, DM-SMR disks) left the extent tree and free space tree in a state that no native repair path could resolve. A subsequent btrfs check --repair run entered an infinite loop of 46,000+ commits with zero net progress, rotating the 4 backup_roots slots past any pre-crash rollback point. Recovery eventually succeeded through a set of 14 custom C tools built against the internal btrfs-progs API, with a final data loss of about 7.2 MB out of 4.59 TB (0.00016 percent). The pool is now fully operational.

Full analysis

I wrote the case up in a structured way that covers environment, timeline, root cause classification, the bulletproof safety criterion we derived empirically, and 9 specific areas where a relatively small upstream change would have prevented the need for most of the custom tooling.

https://github.com/msedek/btrfs_fixes/blob/main/INCIDENT-ANALYSIS.md

The nine proposed improvement areas, in order of expected impact on operators hitting similar cases:

A. Progress detection in btrfs check --repair so 46,000 commit loops abort with a clear message instead of destroying backup_roots.
B. Symmetric handling of BTRFS_ADD_DELAYED_REF in reinit_extent_tree, matching the existing BTRFS_DROP_DELAYED_REF exemption.
C. Sibling safety precheck in btrfs_del_items rebalance so a drain below LEAF_DATA_SIZE/4 does not trigger push_leaf_left on a stale sharable sibling.
D. Supervised EEXIST handling in alloc_reserved_tree_block with three explicit modes (error, silent, update).
E. A btrfs rescue rebuild-extent-tree subcommand that operates from a pre-scanned ref list, as an alternative to the currently deadlocking --init-extent-tree.
F. A btrfs rescue clean-orphan-inodes subcommand with a built-in dry-run that applies the bulletproof 5-condition check and produces a machine-readable plan.
G. A btrfs rescue fix-bg-accounting for surgical BLOCK_GROUP_ITEM.used fixes after bulk extent tree rebuild.
H. Clearer documentation that backup_roots[0..3] is a four commit sliding window, not historical backup (widely misunderstood).
I. Documentation of the DIR i_size = sum(namelen * 2) rule, which bit us during orphan dir entry cleanup and is not currently written down in any user facing place.

Reference implementation

All 14 custom tools, along with the single-line patch to alloc_reserved_tree_block, are published in GPL-2.0 form at:

https://github.com/msedek/btrfs_fixes

Every tool has a read-only scan mode by default and a --write mode that is opt-in. The README.md explains the execution order used during recovery. I am not proposing these as upstream patches directly. Most of the proposals above are not single function changes and getting any of them accepted would require a design discussion with people more familiar than me with the subsystems involved. Sharing the reference implementation felt more useful than opening nine separate pull requests without context.

How I would like this to be received

Please treat this as input, not as a demand. If any single observation or proposal is worth pursuing, I am happy to expand the analysis, provide additional evidence from the session logs, or test any proposed patch against the class of damage we hit. If none of it is useful, no problem, and thanks for the tool set that got us most of the way there.

Thanks again.