原文
原始链接: https://news.ycombinator.com/item?id=41578516
用户讨论了他们在新购买的笔记本电脑上频繁安装词典和 GCIDE(一种特定词典软件)的做法。 他们提到 GCIDE 包含 1913 年版韦氏词典的部分内容。 该用户进一步解释说,他们最近查找了有关压缩字典的信息,特别是通过使用“dictzip”。 该工具创建支持随机搜索的 gzip 兼容文件,使其适合与“dictd”服务器一起使用。 用户提供的示例展示了在使用“dictzip”压缩之前将大型字典文件分割成较小的 256kb 块,与一次压缩整个文件相比,速度和效率的提高。 此外,用户提到 zip 格式提供了类似的优点,同时提供了更大的灵活性。 此外,用户指出,为传统硬盘驱动器设计的某些算法即使在压缩数据被分解为大约 1 MB 大小的块时也能有效工作。 最后,用户简要介绍了“fortune”包,它使用不同的方法来组织和搜索短语。
totally by coincidence i was looking at the dictzip man page this morning; it produces gzip-compatible files that support random seeks so you can keep the database for your dictd server compressed. (as far as i know, rik faith's dictd is still the only server implementation of the dict protocol, which is incidentally not a very good protocol.) you can see that the penalty for seekability is about 6% in this case:
nowadays computers are fast enough that it probably isn't a big win to gzip in such small chunks (dictzip has a chunk limit of 64k) and you might as well use a zipfile, all implementations of which support random access: so you see 256-kibibyte chunks have submillisecond decompression time (more like 2 milliseconds on my cellphone) and only about a 1.8% size penalty for seekability: and, unlike the dictzip format (which lists the chunks in an extra backward-combatible file header), zip also supports efficient appendingeven in python (3.11.2) it's only about a millisecond:
this kind of performance means that any algorithm that would be efficient reading data stored on a conventional spinning-rust disk will be efficient reading compressed data if you put the data into a zipfile in "files" of around a meg each. (writing is another matter; zstd may help here, with its order-of-magnitude faster compression, but info-zip zip and unzip don't support zstd yet.)dictd keeps an index file in tsv format which uses what looks like base64 to locate the desired chunk and offset in the chunk:
this is very similar to the index format used by eric raymond's volks-hypertext https://www.ibiblio.org/pub/Linux/apps/doctools/vh-1.8.tar.g... or vi ctags or emacs etags, but it supports random access into the filestrfile from the fortune package works on a similar principle but uses a binary data file and no keys, just offsets:
of course if you were using a zipfile you could keep the index in the zipfile itself, and then there's no point in using base64 for the file offsets, or limiting them to 32 bits