法布里斯·贝拉尔的 TS Zip

法布里斯·贝拉尔的 TS Zip
Fabrice Bellard's TS Zip (2024)

## ts_zip：基于LLM的文本压缩 ts_zip是一个实验性工具，利用大型语言模型（LLM）进行文本压缩，压缩率明显高于传统的xz等方法。它使用RWKV 169M v4模型来预测文本文件中的下一个token，并采用算术编码进行压缩。虽然ts_zip提供令人印象深刻的压缩效果（基准测试中约为1.1-1.4 bpb，而xz为1.7-2.7 bpb），但它也存在局限性。它需要GPU才能达到合理的速度（在RTX 4090上可达1MB/s），并需要4GB的RAM，目前仅支持文本文件——二进制文件几乎没有压缩效果。 ts_zip在处理英文文本时效果最佳，但也支持其他语言和源代码。重要的是，由于其实验性质，版本之间的向后兼容性无法保证。更多细节和基准测试结果可在Fabrice Bellard的网站上找到。

法布里斯·贝拉尔发布了“TS Zip”，这是一种利用大型语言模型（LLM）的新压缩工具。与传统压缩不同，TS Zip 不仅仅存储数据，而是使用 LLM 来*预测*文本中的下一个token，并使用算术编码来编码该预测的概率。这允许无损压缩，但关键在于需要一个确定性的 LLM 来保证一致的编码和解码。 Hacker News 上的讨论指出，报告的“压缩大小”不包括 LLM 和运行时代码——这一点与标准压缩基准进行了比较。用户还指出这是一种根本不同的压缩类型，依赖于 LLM 对语言的“理解”，类似于人类记忆和重建信息的方式。一位用户戏谑地建议将压缩后的数据称为“tokables”。核心思想是探索 LLM 通过预测固有地实现压缩的方式，以及是否可以将其用于实用的数据压缩。

ts_zip: Text Compression using Large Language Models The ts_zip utility can compress (and hopefully decompress) text files using a Large Language Model. The compression ratio is much higher than with other compression tools. There are some caveats of course:

A GPU is necessary to get a reasonable speed. 4 GB of RAM is required.
It is slower than conventional compressors (compression and decompression speed: up to 1 MB/s on a RTX 4090).
Only text files are supported. Binary files won't be compressed much. The currently used language model (RWKV 169M v4) was trained mostly on English texts. Other languages are supported including source code.
It is experimental so no backward compability should be expected between the various versions.
See also ts_sms which is optimized for the compression of small messages.

Compression Ratio

The compression ratio is given in bits per byte (bpb).

File	Original size (bytes)	xz (bytes) (bpb)		ts_zip (bytes) (bpb)
alice29.txt	152089	48492	2.551	21713	1.142
book1	768771	261116	2.717	137477	1.431
enwik8	100000000	24865244	1.989	13825741	1.106
enwik9	1000000000	213370900	1.707	135443237	1.084
linux-1.2.13.tar	9379840	1689468	1.441	1196859	1.021

Results and speed for other programs on enwik8 and enwik9 are available at the Large Text Compression Benchmark.

Download

Technical information

ts_zip uses the RWKV 169M v4 language model which is a good compromise between speed and compression ratio. The model is quantized to 8 bits per parameter and evaluated using BF16 floating point numbers.
The language model predicts the probabilities of the next token. An arithmetic coder then encodes the next token according to the probabilities.
The model is evaluated in a deterministic and reproducible way. Hence the result does not depend on the exact GPU or CPU model nor on the number of configured threads. This key point ensures that a compressed file can be decompressed using a different hardware or software configuration.

Fabrice Bellard - https://bellard.org/