大型语言模型是否应该将文本内容直接当作图像来处理？

大型语言模型是否应该将文本内容直接当作图像来处理？
Should LLMs just treat text content as an image?

原始链接: https://www.seangoedecke.com/text-tokens-as-image-tokens/

## 光学压缩：一种潜在的AI效率提升 DeepSeek的最新研究强调了一个令人惊讶的发现：将文本表示为图像（“光学压缩”）比直接处理文本对AI模型来说效率可能更高。他们的研究表明，单个图像token可以准确地表示大约10个文本token，利用了图像嵌入的连续性，而文本token是离散的。这对于降低成本和增加数据容量具有影响。类似于加速音频以降低转录成本，在将文本输入模型之前将其转换为图像，可以允许在推理过程中输入更多数据。诸如逐渐降低旧上下文的图像分辨率——模仿人类记忆——之类的策略也在被探索。虽然目前这是一种小众技术，一些早期实现显示出希望，但其潜力是巨大的。它引发了一个问题，即未来的AI模型是否应该从根本上将文本视为一种图像数据，可能模仿人类处理信息的方式。在专门针对基于图像的文本训练模型方面仍然存在挑战，但光学压缩的概念为更高效和强大的AI系统提供了一条引人注目的途径。

## LLM 与图像处理：总结最近Hacker News上的一场讨论探讨了将文本视为图像以供大型语言模型（LLM）处理的想法。核心问题是，绕过传统的文本分词，直接将文本作为视觉数据处理是否有利。虽然过去将数据转换为图像用于机器学习的尝试常常失败，但支持者认为图像具有优势：它们能够捕捉基本文本编码遗失的细微差别（如格式、字母形状），并利用连续的token而非离散的token。然而，批评者指出，这些优势并不*需要*图像转换步骤，仅仅突出了当前分词方法的局限性。例如，将音乐转换为频谱图进行AI生成，以及DeepMind的DeepVariant（基因组数据转换为图像）都取得了成功，但这些是特定案例。争论的中心在于，图像处理增加的复杂性是否超过潜在收益，特别是考虑到文本token中固有的上下文和嵌入信息。最终，这场讨论凸显了人们对更高效和信息丰富的LLM数据表示方法的持续探索，并质疑未来是否在于视觉处理或改进的文本编码技术。

原文

Several days ago, DeepSeek released a new OCR paper. OCR, or “optical character recognition”, is the process of converting an image of text - say, a scanned page of a book - into actual text content. Better OCR is obviously relevant to AI because it unlocks more text data to train language models on^{. But there’s a more subtle reason why really good OCR might have deep implications for AI models.}

Optical compression

According to the DeepSeek paper, you can pull out 10 text tokens from a single image token with near-100% accuracy. In other words, a model’s internal representation of an image is ten times as efficient as its internal representation of text. Does this mean that models shouldn’t consume text at all? When I paste a few paragraphs into ChatGPT, would it be more efficient to convert that into an image of text before sending it to the model? Can we supply 10x or 20x more data to a model at inference time by supplying it as an image of text instead of text itself?

This is called “optical compression”. It reminds me of a funny idea from June of this year to save money on OpenAI transcriptions: before uploading the audio, run it through ffmpeg to speed it up by 2x. The model is smart enough to still pull out the text, and with one simple trick you’ve cut your inference costs and time by half. Optical compression is the same kind of idea: before uploading a big block of text, take a screenshot of it (and optionally downscale the quality) and upload the screenshot instead.

Some people are already sort-of doing this with existing multimodal LLMs. There’s a company selling this as a service, an open-source project, and even a benchmark. It seems to work okay! Bear in mind that this is not an intended use case for existing models, so it’s plausible that it could get a lot better if AI labs start actually focusing on it.

The DeepSeek paper suggests an interesting way^{to use tighter optical compression for long-form text contexts. As the context grows, you could decrease the resolution of the oldest images so they’re cheaper to store, but are also literally blurrier. The paper suggests an analogy between this and human memory, where fresh memories are quite vivid but older ones are vaguer and have less detail.}

Why would this work?

Optical compression is pretty unintuitive to many software engineers. Why on earth would an image of text be expressible in fewer tokens than the text itself?

In terms of raw information density, an image obviously contains more information than its equivalent text. You can test this for yourself by creating a text file, screenshotting the page, and comparing the size of the image with the size of the text file: the image is about 200x larger. Intuitively, the word “dog” only contains a single word’s worth of information, while an image of the word “dog” contains information about the font, the background and text color, kerning, margins, and so on. How, then, could it be possible that a single image token can contain ten tokens worth of text?

The first explanation is that text tokens are discrete while image tokens are continuous. Each model has a finite number of text tokens - say, around 50,000. Each of those tokens corresponds to an embedding of, say, 1000 floating-point numbers. Text tokens thus only occupy a scattering of single points in the space of all possible embeddings. By contrast, the embedding of an image token can be sequence of those 1000 numbers. So an image token can be far more expressive than a series of text tokens.

Another way of looking at the same intuition is that text tokens are a really inefficient way of expressing information. This is often obscured by the fact that text tokens are a reasonably efficient way of sharing information, so long as the sender and receiver both know the list of all possible tokens. When you send a LLM a stream of tokens and it outputs the next one, you’re not passing around slices of a thousand numbers for each token - you’re passing a single integer that represents the token ID. But inside the model this is expanded into a much more inefficient representation (inefficient because it encodes some amount of information about the meaning and use of the token)^{. So it’s not that surprising that you could do better than text tokens.}

Zooming out a bit, it’s plausible to me that processing text as images is closer to how the human brain works. To state the obvious, humans don’t consume text as textual content; we consume it as image content (or sometimes as audio). Maybe treating text as a sub-category of image content could unlock ways of processing text that are unavailable when you’re just consuming text content. As a toy example, emoji like :) are easily-understandable as image content but require you to “already know the trick” as text content^.

Final thoughts

Of course, AI research is full of ideas that sounds promising but just don’t work that well. It sounds like you should be able to do this trick on current multimodal LLMs - particularly since many people just use them for OCR purposes anyway - but it hasn’t worked well enough to become common practice.

Could you train a new large language model on text represented as image content? It might be tricky. Training on text tokens is easy - you can simply take a string of text and ask the model to predict the next token. How do you train on an image of text?

You could break up the image into word chunks and ask the model to generate an image of the next word. But that seems to me like it’d be really slow, and tricky to check if the model was correct or not (e.g. how do you quickly break a file into per-word chunks, how do you match the next word in the image, etc). Alternatively, you could ask the model to output the next word as a token. But then you probably have to train the model on enough tokens so it knows how to manipulate text tokens. At some point you’re just training a normal LLM with no special “text as image” superpowers.