Gzpeek:用于解析Gzip元数据的工具
Gzpeek: Tool to Parse Gzip Metadata

原始链接: https://evanhahn.com/introducing-gzpeek/

## gzpeek:揭示Gzip文件中的隐藏数据 Gzip文件除了压缩数据外,还包含着丰富的元数据。一个新的工具,**gzpeek**,允许用户检查这些隐藏的信息。Gzip头部会显示诸如用于压缩的操作系统(尽管可靠性各异——有些工具会硬编码值或将其设置为“未知”)、原始文件的修改时间、指示潜在文本数据的标志、压缩级别、原始文件名,甚至可选的注释或任意数据等细节。 作者出于对文件格式规范的兴趣以及学习Zig语言的愿望,创建了gzpeek来探索这些细节。虽然操作系统标志并不总是准确,并且修改时间存在日期限制,但这些元数据提供了对文件历史的有趣见解。 gzpeek是一个命令行工具,可以提取这些信息,从而比以往任何时候都更深入地了解gzip文件。它可供使用,作者鼓励用户探索并分享他们的发现。

Hacker News新 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交登录 Gzpeek: 解析 Gzip 元数据的工具 (evanhahn.com) 11 分,ingve 发表于 2 小时前 | 隐藏 | 过去 | 收藏 | 1 条评论 帮助 mzajc 发表于 2 分钟前 [–] 有趣的东西!有点奇怪的是 gzip 本身不会显示(有时甚至不会写入)这些值。另一个显示这些值的一些工具是 file(1),至少在 Debian 的 libmagic-mgc 中,例如:gzip 压缩数据,曾为“file.txt”,最后修改时间:2026 年 3 月 1 日 20:21:34,来自 Unix,原始大小模 2^32 为 5 gzip 压缩数据,附加字段,有评论,最后修改时间:2017 年 12 月 20 日 10:30:35,来自 Unix (奇怪的是,它没有显示实际的评论字符串)回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文
Introducing gzpeek, a tool to parse gzip metadata

In short: gzip streams contain metadata, like the operating system that did the compression. I built a tool to read this metadata.

I love reading specifications for file formats. They always have little surprises.

I had assumed that the gzip format was strictly used for compression. My guess was: a few bytes of bookkeeping, the compressed data, and maybe a checksum.

But then I read the spec. The gzip header holds more than I expected!

In addition to two bytes identifying the data as gzip, there’s also:

  • The operating system that did the compression. This was super surprising to me! There’s a single byte that identifies the compressor’s OS: 0 for Windows, 1 for the Amiga, 3 for Unix, and many others I’d never heard of. Compressors can also set 255 for an “unknown” OS.

    Different tools set this value differently. zlib, the most popular gzip library, changes the flag based on the operating system. (It even defines some OSes that aren’t in the spec, like 18 for BeOS.) Many other libraries build atop zlib and inherit this behavior, such as .NET’s GZipStream, Ruby’s GzipWriter, and PHP’s gzencode.

    Java’s GZIPOutputStream, JavaScript’s CompressionStream, and Go’s compress/gzip set the OS to “unknown” regardless of operating system. Some, like Zopfli and Apache’s mod_deflate, hard-code it to “Unix” no matter what.

    All that to say: in practice, you can’t rely on this flag to determine the source OS, but it can give you a hint.

  • Modification time for the data. This can be the time that compression started or the modification time of the file. It can also be set to 0 if you don’t want to communicate a time.

    This is represented as an unsigned 32-bit integer in the Unix format. That means it can represent any moment between January 1, 1970 and February 7, 2106. I hope we devise a better compression format in the next ~80 years, because we can only represent dates in that range.

    In my testing, many implementations set this to 0. A few set it to the current time or the file’s modification time—the gzip command is one of these.

  • FTEXT, a boolean flag vaguely indicating that the data is “probably ASCII text”. When I say vaguely, I mean it: the spec “deliberately [does] not specify the algorithm used to set this”. This is apparently for systems which have different storage formats for ASCII and binary data.

    In all my testing, nobody sets this flag to anything but 0.

  • An extra flag indicating how hard the compressor worked. 2 signals that it was compressed with max compression (e.g., gzip -9), 4 for the fastest algorithm, and 0 for everything else.

    In practice, zlib and many others set this correctly per the spec, but some tools hard-code it to 0. And as far as I can tell, this byte is not used during decompression, so it doesn’t really matter.

  • The original file name. For example, when I run gzip my_file.txt, the name is set to my_file.txt. This field is optional, so many tools don’t set it, but the gzip command line tool does. You can disable that with gzip --no-name.

  • A comment. This optional field is seldom used, and many decompressors ignore it. But you could add a little comment if you want.

  • Extra arbitrary data. If the other metadata wasn’t enough, you can stuff whatever you want into arbitrary subfields. Each subfield has a two-byte identifier and then 0 or more bytes of additional info.

That’s way more info than I expected!

I was intrigued by this metadata and I’ve been wanting to learn Zig, so I wrote gzpeek.

gzpeek is a command-line tool that lets you inspect the metadata of gzip streams. Here’s how to read metadata from a gzipped file:

gzpeek my_file.gz
# FTEXT: 0
# MTIME: 1591676406
# XFL: 2
# OS: 3 (Unix)
# NAME: my_file.txt

It extracts everything I listed above: the operating system, original file name, modification time, and more. I used it a bunch when surveying different gzip implementations.

Give it a try, and let me know what gzip metadata you find.

联系我们 contact @ memedata.com