步骤 3.7 刷写
Step 3.7 Flash

原始链接: https://static.stepfun.com/blog/step-3.7-flash/

Step 3.7 Flash 是一款智能体基础模型,它利用测试时缩放(test-time scaling)而非单纯依赖参数规模来实现高水平的视觉性能。通过调用专用工具,该模型弥补了其体积较小的劣势,能够媲美规模大其五倍的模型性能。 主要功能包括: * **视觉搜索:** 通过集成外部搜索能力增强识别效果,其性能可与规模大得多的模型相媲美。 * **Python 集成:** 提供统一的代码接口(缩放、裁剪、像素级处理),以处理复杂的、高分辨率的推理任务。 * **图形用户界面(GUI)操作:** 实现对智能手机应用程序稳健的长程控制,在 Android Daily 基准测试中表现优于规模更大的模型。 该模型的一项重大突破是其**涌现出的组合泛化能力**。Step 3.7 Flash 能够自主结合视觉和非视觉工具(例如先编写代码,然后使用图形界面来验证其输出),而无需明确的训练。这种跨领域迭代和自我修正的能力,标志着智能体推理的一大进步,使模型能够执行超越标准文本交互的复杂现实任务。

Hacker News 的讨论聚焦于阶跃星辰(Stepfun)发布的新款人工智能模型 **Step-3.7 Flash**。用户反馈该模型表现强劲,指出 Q4_K_S 版本的 GGUF 文件在 Apple Silicon 芯片上运行高效,能够实现极高的每秒处理 token 数。 讨论帖的主要内容包括: * **易用性:** 用户建议通过 Hugging Face 下载 GGUF 文件,并利用 Ollama 在本地运行该模型。 * **能力:** 早期使用者称赞了该模型的推理能力,特别是在视觉识别任务中,表现优于同类别的其他模型。 * **使用挑战:** 非中文母语者认为该平台难以使用。网站的本地化被描述为“半成品”,通常需要借助浏览器翻译,这会破坏网页布局,因为“英语”界面选项不完善或缺失。 * **社区评价:** 尽管该模型的技术输出受到高度赞扬,但一些用户对公司名称“Stepfun”展开了无关紧要且带有轻视意味的争论。 总体而言,舆论认为 Step-3.7 Flash 是一款极具竞争力的模型,能够提供令人印象深刻的结果,但该厂商针对国际用户的体验仍是一个显著的障碍。
相关文章

原文

We establish Step 3.7 Flash as an agentic foundation model with vision input support, shifting perception and recognition from parametric capacity to test-time scaling with visual tools. As the first of these, we strengthen its ability to invoke the Visual Search tool, thereby compensating for the parametric knowledge deficiencies caused by Step 3.7 Flash's limited model size. As shown in the table below, on visual recognition tasks, Step 3.7 Flash with Visual Search achieves performance on par with models five times its size.

Visual Recognition with Visual Search

Flash Level Pro Level
Benchmarks Step 3.7 Flash Kimi K2.6 GLM 5V Turbo GPT 5.5
SimpleVQA 79.16% 78.24%* 78.20% 79.11%*
WorldVQA 58.10% 55.98%* 47.81%* 54.58%*
BC-VL 58.96% 57.12%* 51.90%* 65.68%*

For a broader set of challenging vision tasks that demand fine-grained perception over high-resolution images or visual reasoning capabilities—such as V*, HR-Bench, and VisualProbe—we grant the model an enriched action space to interact with images, including cropping, zooming in and out, and drawing pixels or bounding boxes. These tools are implemented as a unified code interface, commonly referred to in the field as the Python tool. With Python, Step 3.7 Flash achieves exceptionally strong performance on these benchmarks.

Visual Perception with Python Tool

Flash Level Pro Level
Benchmarks Step 3.7 Flash Kimi K2.6 GLM 5V Turbo Gemini 3 Flash
V* 95.29% 96.90% 89.00% 96.30%
HR-Bench 4K 89.13% 91.25%* 84.62% 94.50%
HR-Bench 8K 86.34% 90.13%* 83.12% 94.80%
VisualProbe 65.05% 64.47%* 53.01% 69.90%

One particularly interesting finding is the emergent ability of compositional generalization across visual and other tools. During testing, Step 3.7 Flash seamlessly combined visual tools with non-visual ones to accomplish complex tasks, despite never having been explicitly guided toward such compositional tool use during training.

Visual Reasoning with Python Tool

Compositional Usage across Visual and Non-visual Tools

Operating graphical user interfaces (GUI) is another foundational visual capability for an agentic model — many real-world tasks live beyond the chatbox and the CLI, and require the agent to see, click, and verify. We extend Step 3.7 Flash with GUI operation, in particular for the Phone-use stack, so that it can complete long-horizon tasks across multiple apps. On the Android Daily benchmark, Step 3.7 Flash achieves a substantial improvement over last year's Step-GUI in stability, robustness, and long-horizon completion, and ahead of other models of larger scale.

Score of Android Daily Benchmark

The same compositional pattern we observed across visual tools also surfaces here: in the following case, after writing a piece of frontend code, the model autonomously turned to the GUI to test the page it had just produced — inspecting the rendered output, exercising interactive elements, and iterating on its own code based on what it saw. Again, this code-and-GUI compositional behavior was never explicitly demonstrated or rewarded during training, yet emerges robustly in test-time use.

GUI Operation

联系我们 contact @ memedata.com