DjVu 及其与深度学习的联系 (2023)
DjVu and its connection to Deep Learning (2023)

原始链接: https://scottlocklin.wordpress.com/2023/05/31/djvu-and-its-connection-to-deep-learning/

## DjVu:被忽视的文档格式 DjVu 是一种在处理扫描文档(尤其是书籍和学术论文)方面明显优于 PDF 的文件格式。虽然现代 PDF 已经采用了一些 DjVu 的创新技术,但在高效处理扫描内容方面仍有所不足。PDF 通常将扫描件视为简单的图像(如 JPEG),导致文件体积庞大且文本表现力差。然而,DjVu 能够智能地将文档分析为文本和图像的混合体,丢弃冗余数据,从而实现显著更好的压缩。 DjVu 由后来创立深度学习的先驱们(包括 Yann LeCun 和 Léon Bottou)创建,它利用小波和算术编码(IW44 和 JB2)等先进的压缩技术来实现令人印象深刻的文件大小。它甚至包含可能容易受到攻击的元素,突出了像 PDF 这样复杂格式中固有的安全风险。 尽管 DjVu 具有技术优势,并有可能创建一个庞大的在线图书馆,但由于操作系统和浏览器缺乏原生支持,它未能获得主流应用。如今,访问 DjVu 文件通常需要专门的软件,例如在已 root 的电子阅读器上使用 Koreader,这对于一种理想的便携式扫描内容格式来说,是一个令人沮丧的障碍。作者认为,DjVu 保存的知识甚至可能超过许多现代、数字原生信息的价值。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 DjVu 及其与深度学习的联系 (2023) (scottlocklin.wordpress.com) 4 分,by tosh 4 小时前 | 隐藏 | 过去 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

DjVu is a vastly superior file format for books, mathematical papers and just about anything else you can think of to original PDF (current year PDF adopted some of its innovations, but they’re only used to break into your ipotato afaik). PDF is mostly postscript with a bunch of weird metadata and layers. This is fine if the PDF is generated by LaTeX and is designed to be something that comes out a printer. But most of the world’s useful text is still on pieces of paper that have to be scanned to be on the interbutts. DjVu is good at sharing compressed book scans, and PDF is not. It shows its superiority when someone makes a big image scan in PDF, which is just a bunch of photographic images in jpeg (which is absolute shit at representing text in part because of how the FFT works) or tiff. DjVu assumes that the data is some kind of mix of text and images and as such most of the data can be safely thrown away.  This is a good assumption; usually I just want the text and plots, and DjVu captures those well. PDF generally clownishly captures everything in a scan more or less as a bitmap, or using jpeg’s silly rastered cosine transform.

Why jpeg sucks on text

Yann LeCun, Léon Bottou and Yoshua Bengio were creators of DjVu along with some other guys you’re less likely to have heard of (Patrick Haffner, Bill Riemers etc etc). All three are also  fathers of Deep Learning (along with Geoffrey Hinton before he developed his peculiar fear of ballpoint pens). Leon and Yann are also creators of my favorite little programming language, Lush -the Lisp Universal Shell, where they did much of their pioneering work back in the 90s. While I know enough about programming languages to understand why current year Torch migrated from Lush to Lua to Python, it will always remain one of my all time favorite designs; as interesting in its own way as the K family, and since it never had to do crap like maintain order books …. one of the comfiest languages I’ve ever used. When I retire I’ll probably revive it and use it to power superior robot vacuum cleaners or something. It’s really that good. There’s so much cool shit hanging around in it from their R&D days as well; just mind boggling stuff -like looking at Leonardo’s notebooks. I ain’t even talking about the neural stuff; all of it from the Ogre UI to the codebook editor is genius.


Since deep learning models is all the bugman talks about any more, the older work product of their creators should interest people, at least for historical perspective. It was an important problem: in 1998 the internet was still pretty new and stuff like PDFs didn’t quite work right. We mostly downloaded LZ77/Huffman coded postscript files when we wanted to use the internet for its original purpose of sharing scientific papers. Those were awful. It wasn’t awful because you had to unzip the file before you could look at it, but you did, but  because they were quite large (maybe 4x what PDFs delivered from compiled LaTeX), and the internet in those days was very slow. It would take minutes to download a couple of shitty jpeg files with boobs in them, let alone the 40 megs of javascript that websites now make you download now so they can track you.

At the time DjVu solved an important problem, allowing very good compression ratios and even allowing scanned stuff to be efficiently shared online, potentially making the internet into a super library including all printed books as well as generated net.content. The problem was most operating systems didn’t come with a DjVu reader, but Adobe made sure everyone had a PDF reader. Finding and installing a DjVu reader was a pain in the ass. Browsers in those days mostly couldn’t display either PDFs or DjVu, so that wasn’t even an option.

One of the cool things about DjVu is it internally uses an image format very similar to JPEG2000 for image backgrounds (called IW44). You have probably never seen a JPEG2000 image (unless in a DjVu file), but it’s a fantastic idea using wavelet compression, so if you only get the first quarter of the file you’ll get a pretty nice low-resolution image. It provides a natural way of doing lossy compression; just drop the higher order wavelets. It also compresses better than regular JPEG. The wavelets are further compressed with arithmetic coding which is also a mighty cool idea.

There is another format it used for foreground text (it looks for text) called JB2 which is related to the thing in PDF  which was buffer overflowed on the ipotato by Pegasus. You have to be careful with your document formats; I strongly suspect PDF has more holes like this in it, just because it has so much going on inside. JP2 is cool because it’s a sort of clustering algorithm where it looks for bitmaps which are around the size of characters, then looks for things which are geometrically similar to them; effectively doing a quick and dirty map of pixel clusters into symbols (not necessarily text symbols: the idea is textually agnostic). Then the document is arithmetically code compressed with the symbols.

The arithmetic coding system used is also innovative; it’s called the ZP-coder. It’s similar to other simple run length coding systems in its use of probability tables, but oriented towards decoding performance. It is a shame the ZP-coder isn’t a universal coder, as if it was it might make convincing fake documents based on the corpus in the document (aka do generative prediction the way openai does with neural nets, using a considerably cheaper algorithm). Pretty cool it works well on the wavelets and the text though.

It’s a shame it didn’t catch on better, and there is probably a HBS case study for the full story of why the objectively superior tool failed in the market. It even failed in Internet Archive use, which it was also well suited for. DjVu  still has utility in scanned documents and reading scanned documents. The main problem with it is the problem it had of old: lack of support. Black and white e-book readers like the kindle and the kobo don’t support it natively despite it being just about the perfect format for scanned documents on a limited processor greyscale e-book reader. I personally use a Kobo-Forma rooted with the excellent koreader to get access to the many useful DjVu files I have (basically all my textbooks available on the road). It’s ridiculous that I have to hack a device to get access to physically portable DjVu files, but I suppose scanned books don’t make anybody money.

I’ve long held that most of the knowledge developed since the advent of the internauts is basically anti-knowledge, meaning those scanned books in DjVu are potentially more valuable than all the PDFs in the universe. It would be nice to see it used by more mainstream publishers, but the lack of a DjVu target for things like LaTeX means it probably won’t be. I guess in the meanwhile DjVu is the most punk rock document format.

 

https://en.wikipedia.org/wiki/DjVu

 

联系我们 contact @ memedata.com