Show HN:Qwen-2.5-32B 现已成为最佳开源OCR模型
Show HN: Qwen-2.5-32B is now the best open source OCR model

原始链接: https://github.com/getomni-ai/benchmark/blob/main/README.md

Omni OCR 基准测试是一个开源工具,用于评估大型多模态模型 (LLM) 在文档理解任务中文本和 JSON 提取的准确性。它将传统的 OCR 提供商与 GPT-4o 和 Gemini 等模型进行比较,重点关注关键的文档 => OCR => 提取流程。 该基准测试使用改进的 json-diff 评估 JSON 提取的准确性,根据预测字段和真实字段之间的差异计算准确性。虽然 JSON 准确性是主要的,但也使用 Levenshtein 距离作为文本相似性指标。该项目是开源的,鼓励社区贡献和扩展。 要使用该基准测试,用户需要克隆存储库,安装依赖项,并准备其文档数据集(本地文件或数据库连接)。模型在 `models.yaml` 中配置,方法是定义 OCR 和 JSON 提取提供商并在 `.env` 中设置 API 密钥。运行 `npm run benchmark` 将在带时间戳的 `results.json` 文件中生成结果。提供了一个仪表板以便于查看结果。该项目支持各种提供商,包括 Anthropic、OpenAI、Google、Mistral 等。

根据getomni.ai对1000份文档进行的最新基准测试,Qwen-2.5 VL(72B和32B版本)现在是顶级开源OCR模型,从文档中提取JSON数据的准确率约为75%,其性能可与GPT-4o媲美。该基准测试还显示,Qwen-2.5 VL的准确率超过了专门为OCR任务设计的Mistral-OCR模型。令人惊讶的是,谷歌的Gemma-3(27B)的准确率仅为42.9%,尽管它基于Gemini 2.0架构。该基准测试完全开源,包括数据集和代码,方便复现。博文、基准测试库和数据集的链接已提供。一位评论者询问了mini cpm v2.6的性能。
相关文章

原文

Omni OCR Benchmark

A benchmarking tool that compares OCR and data extraction capabilities of different large multimodal models such as gpt-4o, evaluating both text and json extraction accuracy. The goal of this benchmark is to publish a comprehensive benchmark of OCRaccuracy across traditional OCR providers and multimodal Language Models. The evaluation dataset and methodologies are all Open Source, and we encourage expanding this benchmark to encompass any additional providers.

Open Source LLM Benchmark Results (Mar 2025) | Dataset

Benchmark Results (Feb 2025) | Dataset

image

The primary goal is to evaluate JSON extraction from documents. To evaluate this, the Omni benchmark runs Document ⇒ OCR ⇒ Extraction. Measuring how well a model can OCR a page, and return that content in a format that an LLM can parse.

methodology

We use a modified json-diff to identify differences between predicted and ground truth JSON objects. You can review the evaluation/json.ts file to see the exact implementation. Accuracy is calculated as:

$$\text{Accuracy} = 1 - \frac{\text{number of difference fields}}{\text{total fields}}$$

json-diff

While the primary benchmark metric is JSON accuracy, we have included levenshtein distance as a measurement of text similarity between extracted and ground truth text. Lower distance indicates higher similarity. Note this scoring method heavily penalizes accurate text that does not conform to the exact layout of the ground truth data.

In the example below, an LLM could decode both blocks of text without any issue. All the information is 100% accurate, but slight rearrangements of the header text (address, phone number, etc.) result in a large difference on edit distance scoring.

text-similarity

  1. Clone the repo and install dependencies: npm install
  2. Prepare your test data
    1. For local data, add individual files to the data folder.
    2. To pull from a DB, add DATABASE_URL in your .env
  3. Copy the models.example.yaml file to models.yaml. Set up API keys in .env for the models you want to test. Check out the supported models here.
  4. Run the benchmark: npm run benchmark
  5. Results will be saved in the results/<timestamp>/results.json file.

To enable specific models, create a models.yaml file in the src directory. Check out the models.example.yaml file for the required variables.

models:
  - ocr: gemini-2.0-flash-001 # The model to use for OCR
    extraction: gpt-4o # The model to use for JSON extraction

  - ocr: gpt-4o
    extraction: gpt-4o
    directImageExtraction: true # Whether to use the model's native image extraction capabilities

You can view configuration for each model in the src/models/ folder.

Model Provider Models OCR JSON Extraction Required ENV Variables
Anthropic claude-3-5-sonnet-20241022 ANTHROPIC_API_KEY
OpenAI gpt-4o OPENAI_API_KEY
Gemini gemini-2.0-flash-001, gemini-1.5-pro, gemini-1.5-flash GOOGLE_GENERATIVE_AI_API_KEY
Mistral mistral-ocr MISTRAL_API_KEY
OmniAI omniai OMNIAI_API_KEY, OMNIAI_API_URL
Model Provider Models OCR JSON Extraction Required ENV Variables
Gemma 3 google/gemma-3-27b-it
Qwen 2.5 qwen2.5-vl-32b-instruct, qwen2.5-vl-72b-instruct
Llama 3.2 meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo, meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo
ZeroX zerox OPENAI_API_KEY
Model Provider Models OCR JSON Extraction Required ENV Variables
AWS aws-text-extract AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION
Azure azure-document-intelligence AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT, AZURE_DOCUMENT_INTELLIGENCE_KEY
Google google-document-ai GOOGLE_LOCATION, GOOGLE_PROJECT_ID, GOOGLE_PROCESSOR_ID, GOOGLE_APPLICATION_CREDENTIALS_PATH
Unstructured unstructured UNSTRUCTURED_API_KEY
  • LLMS are instructed to use the following system prompts for OCR and JSON extraction.
  • For Google Document AI, you need to include google_credentials.json in the data folder.

dashboard

You can use benchmark dashboard to easily view the results of each test run. Check out the dashboard documentation for more details.

This project is licensed under the MIT License - see the LICENSE file for details.

联系我们 contact @ memedata.com