自制40行代码的无服务器OCR

自制40行代码的无服务器OCR
Rolling your own serverless OCR in 40 lines of code

原始链接: https://christopherkrapu.com/blog/2026/ocr-textbooks-modal-deepseek/

为了制作 Gelman 的《贝叶斯数据分析》的可搜索版本，作者使用了 DeepSeek 开源 OCR 模型，克服了旧硬件和昂贵商业选择的限制。他们利用 Modal，一个无服务器计算平台，在云 GPU (A100) 上运行 OCR 流程，无需服务器管理，只需为计算时间付费。该过程包括通过 Modal 部署一个 FastAPI 服务器，接受图像上传，并返回 markdown 文本。构建了一个包含必要依赖项（PyTorch、transformers 等）的容器镜像。为了提高效率，采用了批量推理，同时处理多页。DeepSeek 模型的输出，包括定位标签，被清理以生成可搜索的 markdown 文件，每页一个。处理这本 600 页的书籍大约花费了 45 分钟，成本约为 2 美元。生成的可搜索文本允许轻松查询、粘贴到语言模型中以及构建搜索索引——将以前基于图像的 PDF 转换为可用的资源。OCR 质量，特别是对于数学符号，出乎意料地好。

## 无服务器 OCR：40 行代码实现 - 摘要最近一篇 Hacker News 上的帖子详细介绍了如何使用预训练的大型语言模型 (LLM) 构建无服务器光学字符识别 (OCR) 系统。作者旨在数字化一本统计学教科书，该书特意选择了包含大量数学符号的版本，并发现 DeepSeek-OCR 适合该任务。讨论强调了“无服务器”的细微之处——它并不一定意味着*没有*服务器，而是抽象化基础设施管理，并且仅为使用的计算时间付费。虽然 Tesseract 是处理简单文档的可行免费选项，但基于 LLM 的 OCR 在处理复杂布局、手写和数学公式方面表现更出色。其他替代方案，如 dots 和 olmOCR，也被建议为可能更优秀的开源选项。成本是关键考虑因素；作者在 A100 GPU 上处理 600 页内容花费了约 2 美元，而其他解决方案则可能提供更便宜的价格。核心要点是利用云资源进行 OCR 任务，即使这意味着依赖他人的服务器。

A few months ago, I wanted to make my copy of Gelman’s Bayesian Data Analysis searchable for use in a statistics-focused agent.

There are some pretty sophisticated OCR tools out there but they tend to have usage limits or get expensive when you’re processing thousands of pages. DeepSeek recently released an open OCR model that handles mathematical notation well, and I figured I could run it myself if I had access to a GPU. Sadly, my daily driver is a decade-old Titan Xp which no longer supports the latest PyTorch versions and thus can’t run DeepSeek OCR.

I ended up using Modal for this.

Modal is a serverless compute platform that lets you run Python code on cloud infrastructure without managing servers. The killer feature for machine learning work is that you can define a container image, attach a GPU, and pay only for the seconds your code is actually running.

import modal

image = modal.Image.from_registry(
    "nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04",
    add_python="3.11",
).pip_install("torch", "transformers", ...)

app = modal.App("my-gpu-app")

@app.function(image=image, gpu="A100")
def process_something():
    # This runs on an A100 with all your deps installed
    pass

The decorator pattern is what makes Modal pleasant to use. You write normal Python, sprinkle decorators on the functions that need special hardware, and Modal handles the rest: building the container, provisioning the GPU, routing your requests. For OCR, this is perfect.

The OCR script

The core idea is simple: deploy a FastAPI server on Modal that accepts images and returns markdown text. Let’s walk through the important pieces.

Defining the container image

First, we build a container with all the dependencies. DeepSeek’s OCR model needs PyTorch, transformers, and a few image processing libraries:

from pathlib import Path
import modal

APP_NAME = "deepseek-ocr-books-api-batch"
ROOT = Path(__file__).resolve().parents[1]
BOOKS_DIR = ROOT / "references" / "books"
PARSED_DIR = BOOKS_DIR / "parsed"

image = (
    modal.Image.from_registry(
        "nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04",
        add_python="3.11",
    )
    .apt_install("git", "libgl1", "libglib2.0-0")
    .pip_install(
        "torch==2.6.0",
        "torchvision==0.21.0",
        "transformers==4.46.3",
        "PyMuPDF",
        "Pillow",
        "numpy",
        extra_index_url="https://download.pytorch.org/whl/cu118",
    )
)

app = modal.App(APP_NAME)

The paths at the top let us find PDFs relative to the script location, keeping configuration close to where it’s used.

The FastAPI endpoint

Here’s where the magic happens. We wrap a FastAPI server in Modal’s @modal.asgi_app() decorator, which means Modal will handle spinning up GPU instances and routing HTTP requests to them:

@app.function(image=image, gpu="A100", timeout=60 * 60 *2) # timeout of 2 hours
@modal.asgi_app()
def fastapi_app():
    from fastapi import FastAPI, File, UploadFile
    from PIL import Image
    import torch
    from transformers import AutoModel, AutoTokenizer

    api = FastAPI()
    
    model_name = "deepseek-ai/DeepSeek-OCR"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
    model = model.cuda().to(torch.bfloat16).eval()

The model loads once when the container starts. Subsequent requests reuse the same loaded model, which is crucial for throughput when processing hundreds of pages.