Gmail备份的增量更新

Gmail备份的增量更新
Incremental Backups of Gmail Takeouts

原始链接: https://baecher.dev/stdout/incremental-backups-of-gmail-takeouts/

本文详细介绍了作者为Gmail Takeout数据创建高效增量备份解决方案的过程。虽然Google Takeout提供了方便的mbox文件进行完整备份，但由于其结构，重复备份*整个*文件效率低下——新邮件不会简单地附加到现有文件末尾。最初的尝试集中在分离附件以减小备份大小，但由于mbox格式的复杂性和各种编码方案，证明过于复杂。最终的解决方案采用了一种“分块”启发式方法，根据每封电子邮件的“From”行来分割mbox文件。每个区块使用其MD5哈希进行内容寻址，确保抵抗重新排序。这种方法允许进行增量备份，只需存储新的区块和一个小的序列记录。虽然作者的帐户生成大约99.8K个区块，但他承认更大的帐户可能需要调整分块频率以管理文件系统限制。代码可在Github上获取。

这个Hacker News讨论围绕使用“takeouts”（导出的mbox文件）备份Gmail数据。原发帖人（pbhn）创建了一个工具来改进备份过程，因为Gmail takeouts产生的文件的顺序不确定，体积庞大，不适合使用Restic等工具进行高效的增量备份。一位评论者（yooogurt）认为Restic基于哈希的分块*可以*有效地处理更新，尤其是在消息顺序保持稳定或附件被重用的情况下。然而，核心问题在于每次takeout都会完全替换文件。对话随后转向对*长期*电子邮件存档的*需求*。一些用户认为参考旧电子邮件进行价格比较、支持历史记录或个人回忆很有价值，而另一些人则提倡积极删除，因为存储成本低廉且需求很少。这场辩论突出了不同的电子邮件管理和备份策略。

Incremental backups of Gmail takeouts

Home

December 2025

In an earlier writeup I discussed how to create reproducible bundles of git repositories such that a file-based backup strategy can operate incrementally. My next target in this vein is Gmail Takeout: your Google account can be locked for arbitrary reasons, legitimate or otherwise, so it's imperative to have a regular backup of your mail. Google's Takeout service is a straightforward way to achieve this. In my case I have about 20 years of mail history in the account, going back all the way to the invite-only beta. Surprisingly that amounts to 5.7GiB only, with attachments being the driving factor, of course, all delivered in a single text-based mbox file.

This is completely fine for a one-time snapshot, but if you want to back this file up regularly with something like restic, then you will quickly end up in a world of pain: since new mails are not even appended to the end of the file, each cycle of takeout-then-backup essentially produces a new giant file. It would be nice if incremental backups only added the delta of mails that are actually new since the last backup.

I considered (and actually implemented) multiple different solutions to this problem. In one approach, I parsed the entire file and stripped out the attachments in order to store them as separate files, only leaving a link in the corresponding mail. This works reasonably well because the attachments account for the overwhelming majority of data. However, I was not particularly happy with this solution because parsing the file correctly is not trivial and resulted in a lot of complex code. The mail format is very forgiving so you end up with many special cases around peculiar behavior of mail clients. To give you an idea, consider that there is no length encoding; it's all multipart boundaries, and they can be nested, too. Actual attachment data also comes in a variety of different encodings, even file-name encoding is done in several micro formats. In the end I got the correct number of mails with my parser compared to the Gmail interface (accounting for threaded view), but the complexity of that code felt wrong.

What I eventually settled on instead is a simple chunking heuristic based on the From ... line in front of every mail. The catch is that this line can also appear in the body of a mail. This results in slight oversplitting: every mail boundary is also a chunk boundary, but not every chunk boundary is a mail boundary. In other words, one mail may be partitioned into multiple chunks. Each chunk is then saved as a file, content addressed by its MD5 sum. Content addressing makes the approach resistant against mail reordering in the mbox file. We could have used the Gmail mail id for this purpose, but the uniform distribution of the content hash enables easy creation of well-distributed subdirectories such that no single directory contains too many files. To ensure recovery of the original mbox file we finally record the sequence of chunks as encountered. With this we satisfy the requirement that new mails only add new chunks plus a new sequence of chunks, the latter being fairly negligible in size.

With my low-traffic Gmail account, I end up with about 99.8K chunks ≈ mails from 50.6K threads. This is tolerable enough for me, but I can see bigger accounts having 10× or 100× mails, at which point the number of chunks may become a concern from a file-system perspective. One mitigation would be reducing the chunking frequency by introducing arbitrary additional conditions, e.g., only split when hash('From' line) is even.

You can find the app implementation on Github.

Return to top

Gmail备份的增量更新 Incremental Backups of Gmail Takeouts

Gmail备份的增量更新
Incremental Backups of Gmail Takeouts