Qwen3-VL 可以扫描两小时的视频并精确指出几乎每一个细节。
Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

原始链接: https://the-decoder.com/qwen3-vl-can-scan-two-hour-videos-and-pinpoint-nearly-every-detail/

## 阿里巴巴的Qwen3-VL:强大的开源多模态模型 阿里巴巴发布了一份技术报告,详细介绍了其新的开源多模态模型Qwen3-VL,展示了在处理图像、视频和文本方面的显著进步。该模型以Apache 2.0许可提供开放权重,在视觉相关任务中表现出色,尤其是在**基于图像的数学**方面,在MathVista和MathVision等基准测试中超越了Gemini 2.5 Pro和GPT-5等竞争对手。 Qwen3-VL展现了处理长篇内容的卓越能力,能够处理两小时的视频或数百页的文档,其**上下文窗口为256,000个token**——即使在较长的视频中也能准确识别特定的帧。它还在**文档理解(DocVQA)**和**39种语言的OCR**方面表现出强大的性能。 关键改进包括用于更好处理长视频的“交错MRoPE”、用于访问详细视觉信息的“DeepStack”以及简化的时间戳系统。Qwen3-VL在万亿个token上进行训练,并非没有局限性,在通用推理和视频问答方面落后于其他模型。然而,其开源性质和强大的专业能力使其能够加速多模态人工智能的进一步发展。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Qwen3-VL 可以扫描两小时的视频并精确指出几乎每一个细节 (the-decoder.com) 37 分,由 thm 发表于 3 小时前 | 隐藏 | 过去 | 收藏 | 2 条评论 thot_experiment 发表于 1 小时前 | 下一个 [–] 有人能给我总结一下如何最好地开始视频理解吗?我一直本地使用 qwen-30b-vl 作为我的首选模型,因为它速度太快了,想尝试一下视频功能,视觉理解效果很好,我一直用它进行 OCR 和分类。 moralestapia 发表于 26 分钟前 | 上一个 [–] 我认为这已经可以算作某种 ASI 了。 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

A few months after launching Qwen3-VL, Alibaba has released a detailed technical report on the open multimodal model. The data shows the system excels at image-based math tasks and can analyze hours of video footage.

The system handles massive data loads, processing two-hour videos or hundreds of document pages within a 256,000-token context window.

In "needle-in-a-haystack" tests, the flagship 235-billion-parameter model located individual frames in 30-minute videos with 100 percent accuracy. Even in two-hour videos containing roughly one million tokens, accuracy held at 99.5 percent. The test works by inserting a semantically important "needle" frame at random positions in long videos, which the system must then find and analyze.

Heatmap mit Video-Längen auf der y-Achse und Frame-Positionen auf der x-Achse. Die meisten Zellen zeigen hohe Genauigkeitswerte in Prozent, mit perfekten Ergebnissen bei kürzeren Videos.
The needle-in-a-haystack test measures the model's ability to locate specific frames in long videos. | Image: Alibaba

In published benchmarks, the Qwen3-VL-235B-A22B model often beats Gemini 2.5 Pro, OpenAI GPT-5, and Claude Opus 4.1 - even when competitors use reasoning features or high thinking budgets. The model dominates visual math tasks, scoring 85.8 percent on MathVista compared to GPT-5's 81.3 percent. On MathVision, it leads with 74.6 percent, ahead of Gemini 2.5 Pro (73.3 percent) and GPT-5 (65.8 percent).

Ad

Tabelle mit Benchmark-Ergebnissen von Qwen3-VL-235B, Gemini 2.5 Pro, OpenAI GPT-5 und Claude Opus 4.1
Gemini's older 2.5 Pro model maintains a slight lead in general image understanding. | Image: Alibaba

The model also shows range in specialized benchmarks. It scored 96.5 percent on the DocVQA document comprehension test and 875 points on OCRBench, supporting 39 languages - nearly four times as many as its predecessor.

Balkendiagramm der OCR-Genauigkeit von Qwen3-VL für 39 Sprachen, wobei die meisten Balken über der 70-Prozent-Marke liegen.
Qwen3-VL achieves over 70 percent accuracy on OCR tasks in 32 of the 39 supported languages. | Image: Alibaba

Alibaba claims the system demonstrates new capabilities in GUI agent tasks. It achieved 61.8 percent accuracy on ScreenSpot Pro, which tests navigation in graphical user interfaces. On AndroidWorld, where the system must independently operate Android apps, Qwen3-VL-32B hit 63.7 percent.

The model handles complex, multi-page PDF documents as well. It scored 56.2 percent on MMLongBench-Doc for long document analysis. On the CharXiv benchmark for scientific charts, it reached 90.5 percent on description tasks and 66.2 percent on complex reasoning questions.

It is not a clean sweep, however. In the complex MMMU-Pro test, Qwen3-VL scored 69.3 percent, trailing GPT-5's 78.4 percent. Commercial competitors also generally lead in video QA benchmarks. The data suggests Qwen3-VL is a specialist in visual math and documents, but still lags in general reasoning.

Key technical advances for multimodal AI

The technical report outlines three main architectural upgrades. First, "interleaved MRoPE" replaces the previous position embedding method. Instead of grouping mathematical representations by dimension (time, horizontal, vertical), the new approach distributes them evenly across all available mathematical areas. This change aims to boost performance on long videos.

Schematische Darstellung der Qwen3-VL-Architektur mit Vision Encoder links und Large Language Model rechts, verbunden durch Datenflüsse und DeepStack-Verbindungen.
Qwen3-VL combines a vision encoder and language model to process text, images, and videos simultaneously. DeepStack uses visual information from different processing levels. | Image: Alibaba

Second, DeepStack technology allows the model to access intermediate results from the vision encoder, not just the final output. This gives the system access to visual information at different levels of detail.

Third, a text-based timestamp system replaces the complex T-RoPE method found in Qwen2.5-VL. Instead of assigning a mathematical time position to every video frame, the system now inserts simple text markers like "<3.8 seconds>" directly into the input. This simplifies the process and improves the model's grasp of time-based video tasks.

Training at scale with one trillion tokens

Alibaba trained the model in four phases on up to 10,000 GPUs. After learning to link images and text, the system underwent full multimodal training on about one trillion tokens. Data sources included web scrapes, 3 million PDFs from Common Crawl, and over 60 million STEM tasks.

In later phases, the team gradually expanded the context window from 8,000 to 32,000 and finally to 262,000 tokens. The "Thinking" variants received specific chain-of-thought training, allowing them to explicitly map out reasoning steps for better results on complex problems.

Open weights under Apache 2.0

All Qwen3-VL models released since September are available under the Apache 2.0 license with open weights on Hugging Face. The lineup includes dense variants ranging from 2B to 32B parameters, as well as mixture-of-experts models: the 30B-A3B and the massive 235B-A22B.

While features like extracting frames from long videos aren't new - Google's Gemini 1.5 Pro handled this in early 2024 - Qwen3-VL offers competitive performance in an open package. With the previous Qwen2.5-VL already common in research, the new model is likely to drive further open-source development.

联系我们 contact @ memedata.com