(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=41578516

用户讨论了他们在新购买的笔记本电脑上频繁安装词典和 GCIDE(一种特定词典软件)的做法。 他们提到 GCIDE 包含 1913 年版韦氏词典的部分内容。 该用户进一步解释说,他们最近查找了有关压缩字典的信息,特别是通过使用“dictzip”。 该工具创建支持随机搜索的 gzip 兼容文件,使其适合与“dictd”服务器一起使用。 用户提供的示例展示了在使用“dictzip”压缩之前将大型字典文件分割成较小的 256kb 块,与一次压缩整个文件相比,速度和效率的提高。 此外,用户提到 zip 格式提供了类似的优点,同时提供了更大的灵活性。 此外,用户指出,为传统硬盘驱动器设计的某些算法即使在压缩数据被分解为大约 1 MB 大小的块时也能有效工作。 最后,用户简要介绍了“fortune”包,它使用不同的方法来组织和搜索短语。

相关文章

原文




























dict and the relevant dictionaries are things i pretty much always install on every new laptop. gcide in particular includes most of the famous 1913 webster dictionary with its sparkling prose:
    : ~; dict glisten
    2 definitions found

    From The Collaborative International Dictionary of English v.0.48 [gcide]:

      Glisten \Glis"ten\ (gl[i^]s"'n), v. i. [imp. & p. p.
         {Glistened}; p. pr. & vb. n. {Glistening}.] [OE. glistnian,
         akin to glisnen, glisien, AS. glisian, glisnian, akin to E.
         glitter. See {Glitter}, v. i., and cf. {Glister}, v. i.]
         To sparkle or shine; especially, to shine with a mild,
         subdued, and fitful luster; to emit a soft, scintillating
         light; to gleam; as, the glistening stars.

         Syn: See {Flash}.
              [1913 Webster]
it's interesting to think about how you would implement this service efficiently under the constraints of mid-01990s computers, where a gigabyte was still a lot of disk space and multiuser unix servers commonly had about 100 mips (https://netlib.org/performance/html/dhrystone.data.col0.html)

totally by coincidence i was looking at the dictzip man page this morning; it produces gzip-compatible files that support random seeks so you can keep the database for your dictd server compressed. (as far as i know, rik faith's dictd is still the only server implementation of the dict protocol, which is incidentally not a very good protocol.) you can see that the penalty for seekability is about 6% in this case:

    : ~; ls -l /usr/share/dictd/jargon.dict.dz
    -rw-r--r-- 1 root root 587377 Jan  1  2021 /usr/share/dictd/jargon.dict.dz
    : ~; \time gzip -dc /usr/share/dictd/jargon.dict.dz|wc -c
    0.01user 0.00system 0:00.01elapsed 100%CPU (0avgtext+0avgdata 1624maxresident)k
    0inputs+0outputs (0major+160minor)pagefaults 0swaps
    1418350
    : ~; gzip -dc /usr/share/dictd/jargon.dict.dz|gzip -9c|wc -c
    556102
    : ~; units -t 587377/556102 %
    105.62397
nowadays computers are fast enough that it probably isn't a big win to gzip in such small chunks (dictzip has a chunk limit of 64k) and you might as well use a zipfile, all implementations of which support random access:
    : ~; mkdir jargsplit
    : ~; cd jargsplit
    : jargsplit; gzip -dc /usr/share/dictd/jargon.dict.dz|split -b256K
    : jargsplit; zip jargon.zip xaa xab xac xad xae xaf 
      adding: xaa (deflated 60%)
      adding: xab (deflated 59%)
      adding: xac (deflated 59%)
      adding: xad (deflated 61%)
      adding: xae (deflated 62%)
      adding: xaf (deflated 58%)
    : jargsplit; ls -l jargon.zip 
    -rw-r--r-- 1 user user 565968 Sep 22 09:47 jargon.zip
    : jargsplit; time unzip -o jargon.zip xad
    Archive:  jargon.zip
      inflating: xad                     

    real    0m0.011s
    user    0m0.000s
    sys     0m0.011s
so you see 256-kibibyte chunks have submillisecond decompression time (more like 2 milliseconds on my cellphone) and only about a 1.8% size penalty for seekability:
    : jargsplit; units -t 565968/556102 %
    101.77413
and, unlike the dictzip format (which lists the chunks in an extra backward-combatible file header), zip also supports efficient appending

even in python (3.11.2) it's only about a millisecond:

    In [13]: z = zipfile.ZipFile('jargon.zip')

    In [14]: [f.filename for f in z.infolist()]
    Out[14]: ['xaa', 'xab', 'xac', 'xad', 'xae', 'xaf']

    In [15]: %timeit z.open('xab').read()
    1.13 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
this kind of performance means that any algorithm that would be efficient reading data stored on a conventional spinning-rust disk will be efficient reading compressed data if you put the data into a zipfile in "files" of around a meg each. (writing is another matter; zstd may help here, with its order-of-magnitude faster compression, but info-zip zip and unzip don't support zstd yet.)

dictd keeps an index file in tsv format which uses what looks like base64 to locate the desired chunk and offset in the chunk:

    : jargsplit; < /usr/share/dictd/jargon.index shuf -n 4 | LANG=C sort | cat -vte
    fossil^IB9xE^IL8$
    frednet^IB+q5^IDD$
    upload^IE/t5^IJ1$
    warez d00dz^IFLif^In0$
this is very similar to the index format used by eric raymond's volks-hypertext https://www.ibiblio.org/pub/Linux/apps/doctools/vh-1.8.tar.g... or vi ctags or emacs etags, but it supports random access into the file

strfile from the fortune package works on a similar principle but uses a binary data file and no keys, just offsets:

    : ~; wget -nv canonical.org/~kragen/quotes.txt
    2024-09-22 10:44:50 URL:http://canonical.org/~kragen/quotes.txt [49884/49884] -> "quotes.txt" [1]
    : ~; strfile quotes.txt
    "quotes.txt.dat" created
    There were 87 strings
    Longest string: 1625 bytes
    Shortest string: 92 bytes
    : ~; fortune quotes.txt
      Get enough beyond FUM [Fuck You Money], and it's merely Nice To Have
        Money.

            -- Dave Long, , on FoRK, around 2000-08-16, in
               Message-ID <200008162000.NAA10898@maltesecat>
    : ~; od -i --endian=big quotes.txt.dat 
    0000000           2          87        1625          92
    0000020           0   620756992           0         933
    0000040        1460        2307        2546        3793
    0000060        3887        4149        5160        5471
    0000100        5661        6185        6616        7000
of course if you were using a zipfile you could keep the index in the zipfile itself, and then there's no point in using base64 for the file offsets, or limiting them to 32 bits






























































联系我们 contact @ memedata.com