In short: gzip streams contain metadata, like the operating system that did the compression. I built a tool to read this metadata.
I love reading specifications for file formats. They always have little surprises.
I had assumed that the gzip format was strictly used for compression. My guess was: a few bytes of bookkeeping, the compressed data, and maybe a checksum.
But then I read the spec. The gzip header holds more than I expected!
In addition to two bytes identifying the data as gzip, there’s also:
The operating system that did the compression. This was super surprising to me! There’s a single byte that identifies the compressor’s OS:
0for Windows,1for the Amiga,3for Unix, and many others I’d never heard of. Compressors can also set255for an “unknown” OS.Different tools set this value differently. zlib, the most popular gzip library, changes the flag based on the operating system. (It even defines some OSes that aren’t in the spec, like
18for BeOS.) Many other libraries build atop zlib and inherit this behavior, such as .NET’sGZipStream, Ruby’sGzipWriter, and PHP’sgzencode.Java’s
GZIPOutputStream, JavaScript’sCompressionStream, and Go’scompress/gzipset the OS to “unknown” regardless of operating system. Some, like Zopfli and Apache’smod_deflate, hard-code it to “Unix” no matter what.All that to say: in practice, you can’t rely on this flag to determine the source OS, but it can give you a hint.
Modification time for the data. This can be the time that compression started or the modification time of the file. It can also be set to
0if you don’t want to communicate a time.This is represented as an unsigned 32-bit integer in the Unix format. That means it can represent any moment between January 1, 1970 and February 7, 2106. I hope we devise a better compression format in the next ~80 years, because we can only represent dates in that range.
In my testing, many implementations set this to
0. A few set it to the current time or the file’s modification time—thegzipcommand is one of these.FTEXT, a boolean flag vaguely indicating that the data is “probably ASCII text”. When I say vaguely, I mean it: the spec “deliberately [does] not specify the algorithm used to set this”. This is apparently for systems which have different storage formats for ASCII and binary data.
In all my testing, nobody sets this flag to anything but
0.An extra flag indicating how hard the compressor worked.
2signals that it was compressed with max compression (e.g.,gzip -9),4for the fastest algorithm, and0for everything else.In practice, zlib and many others set this correctly per the spec, but some tools hard-code it to
0. And as far as I can tell, this byte is not used during decompression, so it doesn’t really matter.The original file name. For example, when I run
gzip my_file.txt, the name is set tomy_file.txt. This field is optional, so many tools don’t set it, but thegzipcommand line tool does. You can disable that withgzip --no-name.A comment. This optional field is seldom used, and many decompressors ignore it. But you could add a little comment if you want.
Extra arbitrary data. If the other metadata wasn’t enough, you can stuff whatever you want into arbitrary subfields. Each subfield has a two-byte identifier and then 0 or more bytes of additional info.
That’s way more info than I expected!
I was intrigued by this metadata and I’ve been wanting to learn Zig, so I wrote gzpeek.
gzpeek is a command-line tool that lets you inspect the metadata of gzip streams. Here’s how to read metadata from a gzipped file:
gzpeek my_file.gz
# FTEXT: 0
# MTIME: 1591676406
# XFL: 2
# OS: 3 (Unix)
# NAME: my_file.txt
It extracts everything I listed above: the operating system, original file name, modification time, and more. I used it a bunch when surveying different gzip implementations.
Give it a try, and let me know what gzip metadata you find.