Gmail备份的增量更新
Incremental Backups of Gmail Takeouts

原始链接: https://baecher.dev/stdout/incremental-backups-of-gmail-takeouts/

本文详细介绍了作者为Gmail Takeout数据创建高效增量备份解决方案的过程。虽然Google Takeout提供了方便的mbox文件进行完整备份,但由于其结构,重复备份*整个*文件效率低下——新邮件不会简单地附加到现有文件末尾。 最初的尝试集中在分离附件以减小备份大小,但由于mbox格式的复杂性和各种编码方案,证明过于复杂。最终的解决方案采用了一种“分块”启发式方法,根据每封电子邮件的“From”行来分割mbox文件。 每个区块使用其MD5哈希进行内容寻址,确保抵抗重新排序。这种方法允许进行增量备份,只需存储新的区块和一个小的序列记录。虽然作者的帐户生成大约99.8K个区块,但他承认更大的帐户可能需要调整分块频率以管理文件系统限制。代码可在Github上获取。

## 渐进式 Gmail 备份:Hacker News 讨论 一个 Hacker News 帖子讨论了备份 Gmail 数据的问题,起因是关于一个工具的帖子,该工具可以创建比 Google Takeout 的 mbox 文件更易于备份的归档文件。核心问题在于,Takeout 的 mbox 文件不便于使用 `restic` 等工具进行高效的增量备份,因为新邮件不会简单地追加。 建议包括使用完整的 MIME/mbox 解析库(GMime, MimeKit),将邮件存储为单独的文件,或利用 IMAP 与 `git` 进行版本控制。一些用户对原始作者的分块方法恢复的复杂性表示担忧。 许多用户提倡直接使用 Gmail API,配合 `gmvault` 或 `gmbackup` 等工具,将邮件下载为单独的文件。 一个反复出现的争论集中在*保留邮件多长时间*。一些用户会定期删除旧消息,而另一些用户则认为长期归档很有价值,理由包括检索旧的购买信息、证明过去发生的事情或仅仅是怀旧。 讨论还涉及 Google 日益不便的备份 Google Photos 的方法。
相关文章

原文
Incremental backups of Gmail takeouts

Home

December 2025

In an earlier writeup I discussed how to create reproducible bundles of git repositories such that a file-based backup strategy can operate incrementally. My next target in this vein is Gmail Takeout: your Google account can be locked for arbitrary reasons, legitimate or otherwise, so it's imperative to have a regular backup of your mail. Google's Takeout service is a straightforward way to achieve this. In my case I have about 20 years of mail history in the account, going back all the way to the invite-only beta. Surprisingly that amounts to 5.7GiB only, with attachments being the driving factor, of course, all delivered in a single text-based mbox file.

This is completely fine for a one-time snapshot, but if you want to back this file up regularly with something like restic, then you will quickly end up in a world of pain: since new mails are not even appended to the end of the file, each cycle of takeout-then-backup essentially produces a new giant file. It would be nice if incremental backups only added the delta of mails that are actually new since the last backup.

I considered (and actually implemented) multiple different solutions to this problem. In one approach, I parsed the entire file and stripped out the attachments in order to store them as separate files, only leaving a link in the corresponding mail. This works reasonably well because the attachments account for the overwhelming majority of data. However, I was not particularly happy with this solution because parsing the file correctly is not trivial and resulted in a lot of complex code. The mail format is very forgiving so you end up with many special cases around peculiar behavior of mail clients. To give you an idea, consider that there is no length encoding; it's all multipart boundaries, and they can be nested, too. Actual attachment data also comes in a variety of different encodings, even file-name encoding is done in several micro formats. In the end I got the correct number of mails with my parser compared to the Gmail interface (accounting for threaded view), but the complexity of that code felt wrong.

What I eventually settled on instead is a simple chunking heuristic based on the From ... line in front of every mail. The catch is that this line can also appear in the body of a mail. This results in slight oversplitting: every mail boundary is also a chunk boundary, but not every chunk boundary is a mail boundary. In other words, one mail may be partitioned into multiple chunks. Each chunk is then saved as a file, content addressed by its MD5 sum. Content addressing makes the approach resistant against mail reordering in the mbox file. We could have used the Gmail mail id for this purpose, but the uniform distribution of the content hash enables easy creation of well-distributed subdirectories such that no single directory contains too many files. To ensure recovery of the original mbox file we finally record the sequence of chunks as encountered. With this we satisfy the requirement that new mails only add new chunks plus a new sequence of chunks, the latter being fairly negligible in size.

With my low-traffic Gmail account, I end up with about 99.8K chunks ≈ mails from 50.6K threads. This is tolerable enough for me, but I can see bigger accounts having 10× or 100× mails, at which point the number of chunks may become a concern from a file-system perspective. One mitigation would be reducing the chunking frequency by introducing arbitrary additional conditions, e.g., only split when hash('From' line) is even.

You can find the app implementation on Github.


Return to top

联系我们 contact @ memedata.com