（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41203306

大约一年前，一位用户分享了他们改进名为 Tesseract 的光学字符识别 (OCR) 软件的经验。他们使用名为 Llama2 的工具通过纠正常见的 OCR 错误来增强 Tesseract 的输出。那时，由于定价较高，使用 OpenAI 的 GPT4 对大型文档进行此类更正将非常昂贵。然而，如今，OpenAI (GPT4o-mini) 和 Anthropic (Claude3-Haiku) 都存在经济实惠且快速的 API，使用户能够同时处理文档的一小部分，确保快速完成而不需要大量成本。用户提到这些新模型需要最少的修改来消除大多数错误，同时利用多步骤方法提供显着的结果。该方法分为两个阶段：首先请求AI修复OCR错误，处理中字换行等问题；随后，要求模型执行诸如使用 markdown 重新格式化文本等任务，排除重复的页眉和页码等不必要的元素。用户通过此处提供的示例展示了该策略的有效性：原始 PDF、原始 OCR 输出、LLM 校正的 Markdown 输出。他们还建议，尝试改进或修改传统技术（例如使用正则表达式）通常会导致结果恶化。相反，他们建议对提示进行微调，以帮助模型更好地理解指令，并避免在单个处理周期内用大量任务压垮模型。对于那些寻求转换从 Archive.org 或 Google Books 等平台获得的数字化旧书的人来说，这项技术可能是有益的，使它们在亚马逊 Kindle 等设备上适合读者阅读。用户预计在接下来的一年中将取得进一步的进步，使系统接近完美。

原文

Almost exactly 1 year ago, I submitted something to HN about using Llama2 (which had just come out) to improve the output of Tesseract OCR by correcting obvious OCR errors [0]. That was exciting at the time because OpenAI's API calls were still quite expensive for GPT4, and the cost of running it on a book-length PDF would just be prohibitive. In contrast, you could run Llama2 locally on a machine with just a CPU, and it would be extremely slow, but "free" if you had a spare machine lying around.

Well, it's amazing how things have changed since then. Not only have models gotten a lot better, but the latest "low tier" offerings from OpenAI (GPT4o-mini) and Anthropic (Claude3-Haiku) are incredibly cheap and incredibly fast. So cheap and fast, in fact, that you can now break the document up into little chunks and submit them to the API concurrently (where each chunk can go through a multi-stage process, in which the output of the first stage is passed into another prompt for the next stage) and assemble it all in a shockingly short amount of time, and for basically a rounding error in terms of cost.

My original project had all sorts of complex stuff for detecting hallucinations and incorrect, spurious additions to the text (like "Here is the corrected text" preambles). But the newer models are already good enough to eliminate most of that stuff. And you can get very impressive results with the multi-stage approach. In this case, the first pass asks it to correct OCR errors and to remove line breaks in the middle of a word and things like that. The next stage takes that as the input and asks the model to do things like reformat the text using markdown, to suppress page numbers and repeated page headers, etc. Anyway, I think the samples (which take less than 1-2 minutes to generate) show the power of the approach:

Original PDF: https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...

Raw OCR Output: https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...

LLM-Corrected Markdown Output: https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...

One interesting thing I found was that almost all my attempts to fix/improve things using "classical" methods like regex and other rule based things made everything worse and more brittle, and the real improvements came from adjusting the prompts to make things clearer for the model, and not asking the model to do too much in a single pass (like fixing OCR mistakes AND converting to markdown format).

Anyway, this project is very handy if you have some old scanned books you want to read from Archive.org or Google Books on a Kindle or other ereader device and want things to be re-flowable and clear. It's still not perfect, but I bet within the next year the models will improve even more that it will get closer to 100%. Hope you like it!

[0] https://news.ycombinator.com/item?id=36976333

（评论） (comments)

（评论）
(comments)