无限 OCR：单样本长程解析

无限 OCR：单样本长程解析
Unlimited OCR: One-shot long-horizon parsing

原始链接: https://github.com/baidu/Unlimited-OCR

**Unlimited-OCR** 是一款全新的高级文档解析模型，专为单次、长程任务而设计，其基础构建于 Deepseek-OCR 之上。该模型于 2026 年 6 月正式发布，可通过 ModelScope 和 Hugging Face 获取，擅长解析单张图像以及复杂的多页文档或 PDF。核心技术亮点包括： * **灵活推理：** 支持两种配置模式（“gundam”和“base”），可根据文档复杂度和分辨率优化性能。 * **可扩展性：** 具备 32k token 的上下文长度，并采用专门的 ngram 重复预防机制，以确保输出质量。 * **部署选项：** 用户既可以通过 Hugging Face 的 `transformers` 直接进行标准任务推理，也可以利用基于 **SGLang 的服务器**进行高性能、并发及兼容 OpenAI 的 API 调用。 * **易用性：** 资源库包含用于 PDF 转图像及批量处理的辅助脚本，使其非常适合大规模文档数字化项目。该项目作为对社区的开源贡献而开发，并致敬了 Deepseek-OCR 和 PaddleOCR 的既有成果。技术论文及完整实现细节可在 [arXiv](https://arxiv.org/abs/2606.23050) 上查阅。

百度最新的“无限OCR”（Unlimited OCR）研究提出了一种进行高质量、长篇幅文档解析的方法，解决了大语言模型（LLM）常见的内存瓶颈问题。 **问题所在：** 通常情况下，当大语言模型处理长文档（如100页的PDF）时，其“KV缓存”（短期记忆）会呈线性增长，直至耗尽显存。这迫使开发者不得不使用低效的“切割与拼接”方法，而这往往会导致文档上下文的丢失。 **解决方案：** “无限OCR”采用了“参考滑动窗口注意力机制”（R-SWA），将人工智能的关注点分为两条不同的路径： * **全局参考：** 模型保持对文档图像的完整预览，确保不会丢失结构性上下文。 * **局部生成：** 模型在文本生成时的内存被限制在一个紧凑的移动窗口内（例如最近的128个词），使其能够处理无限长的文档而不会崩溃。 **社区反应：** Hacker News上的讨论强调，OCR远未达到“被解决”的状态，特别是在处理手写文档、科学论文或非英语脚本等复杂任务时。虽然有些人质疑使用大语言模型是否大材小用，但另一些人认为，对于处理传统OCR工具难以胜任的版式布局、多栏文本和全文档推理而言，这种具备上下文感知能力的解析方式具有革命性的意义。

原文

Welcome the Era of One-shot Long-horizon Parsing.

[2026/06/23] 📄 Our paper is now available on arXiv.
[2026/06/23] 🤝 Thanks to the ModelScope community for their support. Our model is now available at ModelScope.
[2026/06/22] 🚀 We present Unlimited-OCR, aiming to push Deepseek-OCR one step further.

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.3 + CUDA12.9：

torch==2.10.0
torchvision==0.25.0
transformers==4.57.1
Pillow==12.1.1
matplotlib==3.10.8
einops==0.8.2
addict==2.4.0
easydict==1.13
pymupdf==1.27.2.2
psutil==7.2.2

import os
import torch
from transformers import AutoModel, AutoTokenizer

model_name = 'baidu/Unlimited-OCR'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_safetensors=True,
    torch_dtype=torch.bfloat16,
)
model = model.eval().cuda()

# ── Single image supports two configs: gundam or base ──
# gundam: base_size=1024, image_size=640, crop_mode=True
# base: base_size=1024, image_size=1024, crop_mode=False
model.infer(
    tokenizer,
    prompt='<image>document parsing.',
    image_file='your_image.jpg',
    output_path='your/output/dir',
    base_size=1024, image_size=640, crop_mode=True,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=128,
    save_results=True,
)

# ── Multi page / PDF only uses base (image_size=1024) ──
model.infer_multi(
    tokenizer,
    prompt='<image>Multi page parsing.',
    image_files=['page1.png', 'page2.png', 'page3.png'],
    output_path='your/output/dir',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=1024,
    save_results=True,
)

# ── PDF (convert pages to images, then multi-page parsing) ──
import tempfile, fitz  # PyMuPDF

def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_')
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    paths = []
    for i, page in enumerate(doc):
        out = os.path.join(tmp_dir, f'page_{i+1:04d}.png')
        page.get_pixmap(matrix=mat).save(out)
        paths.append(out)
    doc.close()
    return paths

model.infer_multi(
    tokenizer,
    prompt='<image>Multi page parsing.',
    image_files=pdf_to_images('your_doc.pdf', dpi=300),
    output_path='your/output/dir',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=1024,
    save_results=True,
)

Set up the environment (uv-managed virtualenv). Install the local SGLang wheel first, then pin kernels==0.9.0 and install PyMuPDF for PDF-to-image conversion:

uv venv --python 3.12
source .venv/bin/activate

uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl
uv pip install kernels==0.11.7
uv pip install pymupdf==1.27.2.2

Start the SGLang server:

python -m sglang.launch_server \
    --model baidu/Unlimited-OCR \
    --served-model-name Unlimited-OCR \
    --attention-backend fa3 \
    --page-size 1 \
    --mem-fraction-static 0.8 \
    --context-length 32768 \
    --enable-custom-logit-processor \
    --disable-overlap-schedule \
    --skip-server-warmup \
    --host 0.0.0.0 \
    --port 10000

Send streaming requests to the OpenAI-compatible API:

import base64
import json
import os
import tempfile

import fitz
import requests
from sglang.srt.sampling.custom_logit_processor import DeepseekOCRNoRepeatNGramLogitProcessor

server_url = "http://127.0.0.1:10000"

session = requests.Session()
session.trust_env = False


def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    tmp_dir = tempfile.mkdtemp(prefix="pdf_ocr_")
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    image_paths = []
    for i, page in enumerate(doc):
        image_path = os.path.join(tmp_dir, f"page_{i + 1:04d}.png")
        page.get_pixmap(matrix=mat).save(image_path)
        image_paths.append(image_path)
    doc.close()
    return image_paths


def encode_image(image_path):
    ext = os.path.splitext(image_path)[1].lower()
    mime = "image/jpeg" if ext in (".jpg", ".jpeg") else f"image/{ext.lstrip('.')}"
    with open(image_path, "rb") as f:
        data = base64.b64encode(f.read()).decode("utf-8")
    return {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{data}"}}


def build_content(prompt, image_paths):
    return [{"type": "text", "text": prompt}] + [encode_image(path) for path in image_paths]


def generate(prompt, image_paths, image_mode, ngram_window):
    payload = {
        "model": "Unlimited-OCR",
        "messages": [{"role": "user", "content": build_content(prompt, image_paths)}],
        "temperature": 0,
        "skip_special_tokens": False,
        "images_config": {"image_mode": image_mode},
        "custom_logit_processor": DeepseekOCRNoRepeatNGramLogitProcessor.to_str(),
        "custom_params": {
            "ngram_size": 35,
            "window_size": ngram_window,
        },
        "stream": True,
    }
    response = session.post(
        f"{server_url}/v1/chat/completions",
        headers={"Content-Type": "application/json"},
        data=json.dumps(payload),
        timeout=1200,
        stream=True,
    )
    response.raise_for_status()

    chunks = []
    for line in response.iter_lines(chunk_size=1, decode_unicode=True):
        if not line or not line.startswith("data: "):
            continue
        data = line[len("data: "):]
        if data == "[DONE]":
            break
        event = json.loads(data)
        delta = event["choices"][0].get("delta", {}).get("content", "")
        if delta:
            print(delta, end="", flush=True)
            chunks.append(delta)
    print()
    return "".join(chunks)


# Single image supports two configs: gundam or base. Example below uses gundam.
generate("document parsing.", ["your_image.jpg"], image_mode="gundam", ngram_window=128)

# Multi image (base only)
generate("Multi page parsing.", ["page1.png", "page2.png"], image_mode="base", ngram_window=1024)

# PDF (base only)
generate("Multi page parsing.", pdf_to_images("your_doc.pdf", dpi=300), image_mode="base", ngram_window=1024)

For batch inference, infer.py starts the SGLang server automatically and sends concurrent requests for an image directory or PDF:

# Image directory
python infer.py \
    --image_dir ./examples/images \
    --output_dir ./outputs \
    --concurrency 8 \
    --image_mode gundam

# PDF pages
python infer.py \
    --pdf ./examples/document.pdf \
    --output_dir ./outputs \
    --concurrency 8 \
    --image_mode gundam

Useful options:

--model_dir baidu/Unlimited-OCR   # Local path or Hugging Face model ID
--gpu 0                           # CUDA_VISIBLE_DEVICES value
--server_log ./log/sglang_server.log

We would like to thank Deepseek-OCR, Deepseek-OCR-2, PaddleOCR for their valuable models and ideas.

@misc{yin2026unlimitedocrworks,
      title={Unlimited OCR Works}, 
      author={Youyang Yin and Huanhuan Liu and YY and Qunyi Xie and Chaorun Liu and Shiqi Yang and Shaohua Wang and Zhanlong Liu and Hao Zou and Jinyue Chen and Shu Wei and Jingjing Wu and Mingxin Huang and Zhen Wu and Guibin Wang and Tengyu Du and Lei Jia},
      year={2026},
      eprint={2606.23050},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.23050}, 
}

无限 OCR：单样本长程解析 Unlimited OCR: One-shot long-horizon parsing

Welcome the Era of One-shot Long-horizon Parsing.

无限 OCR：单样本长程解析
Unlimited OCR: One-shot long-horizon parsing