Hello, and thanks in advance for reading.
This is not a bug report. It is a case study write up of a recovery effort on a severely corrupted 12 TB multi-device pool, shared here in case any of the observations are useful to btrfs-progs development. The goal is constructive, not a complaint.
One paragraph summary
A hard power cycle on a 3 device pool (data single, metadata DUP, DM-SMR disks) left the extent tree and free space tree in a state that no native repair path could resolve. A subsequent btrfs check --repair run entered an infinite loop of 46,000+ commits with zero net progress, rotating the 4 backup_roots slots past any pre-crash rollback point. Recovery eventually succeeded through a set of 14 custom C tools built against the internal btrfs-progs API, with a final data loss of about 7.2 MB out of 4.59 TB (0.00016 percent). The pool is now fully operational.
Full analysis
I wrote the case up in a structured way that covers environment, timeline, root cause classification, the bulletproof safety criterion we derived empirically, and 9 specific areas where a relatively small upstream change would have prevented the need for most of the custom tooling.
https://github.com/msedek/btrfs_fixes/blob/main/INCIDENT-ANALYSIS.md
The nine proposed improvement areas, in order of expected impact on operators hitting similar cases:
- A. Progress detection in
btrfs check --repairso 46,000 commit loops abort with a clear message instead of destroyingbackup_roots. - B. Symmetric handling of
BTRFS_ADD_DELAYED_REFinreinit_extent_tree, matching the existingBTRFS_DROP_DELAYED_REFexemption. - C. Sibling safety precheck in
btrfs_del_itemsrebalance so a drain belowLEAF_DATA_SIZE/4does not triggerpush_leaf_lefton a stale sharable sibling. - D. Supervised EEXIST handling in
alloc_reserved_tree_blockwith three explicit modes (error,silent,update). - E. A
btrfs rescue rebuild-extent-treesubcommand that operates from a pre-scanned ref list, as an alternative to the currently deadlocking--init-extent-tree. - F. A
btrfs rescue clean-orphan-inodessubcommand with a built-in dry-run that applies the bulletproof 5-condition check and produces a machine-readable plan. - G. A
btrfs rescue fix-bg-accountingfor surgicalBLOCK_GROUP_ITEM.usedfixes after bulk extent tree rebuild. - H. Clearer documentation that
backup_roots[0..3]is a four commit sliding window, not historical backup (widely misunderstood). - I. Documentation of the DIR
i_size = sum(namelen * 2)rule, which bit us during orphan dir entry cleanup and is not currently written down in any user facing place.
Reference implementation
All 14 custom tools, along with the single-line patch to alloc_reserved_tree_block, are published in GPL-2.0 form at:
https://github.com/msedek/btrfs_fixes
Every tool has a read-only scan mode by default and a --write mode that is opt-in. The README.md explains the execution order used during recovery. I am not proposing these as upstream patches directly. Most of the proposals above are not single function changes and getting any of them accepted would require a design discussion with people more familiar than me with the subsystems involved. Sharing the reference implementation felt more useful than opening nine separate pull requests without context.
How I would like this to be received
Please treat this as input, not as a demand. If any single observation or proposal is worth pursuing, I am happy to expand the analysis, provide additional evidence from the session logs, or test any proposed patch against the class of damage we hit. If none of it is useful, no problem, and thanks for the tool set that got us most of the way there.
Thanks again.