```Qwen3.7-Max 在未知硬件上运行 35 小时，实现了 10 倍的速度提升```

```Qwen3.7-Max 在未知硬件上运行 35 小时，实现了 10 倍的速度提升```
Qwen3.7-Max Ran for 35 Hours on Unknown Hardware and Achieved a 10× Speedup

原始链接: https://firethering.com/alibaba-qwen3-7-max-autonomous-agent/

阿里巴巴的 Qwen3.7-Max 展示了卓越的自主问题解决能力，它成功为陌生的硬件（平头哥 ZW-M890 PPU）优化了生产环境中的“扩展注意力”（Extend Attention）内核。在没有任何事先文档或架构知识的情况下，该模型在 35 小时内自主执行了 1,158 次工具调用，通过不断编写、分析和完善代码，最终实现了 10 倍的速度提升，显著优于 DeepSeek 和 Kimi 等竞争对手。这一成功标志着从传统的静态基准测试向“环境扩展”的转变，即模型并非针对特定数据集进行训练，而是在多样的智能体任务中进行训练。这种方法实现了更强的泛化能力，使模型能够跨不同框架调整策略，而不会过度拟合单一环境。尽管 Qwen3.7-Max 在推理和编程方面与 Claude Opus 3.5 和 DeepSeek V4 等顶尖模型不相上下，但它仍属于闭源 API 模型，不提供权重开源版本。此外，由于基准测试数据由官方自行发布，仍需第三方独立验证。总而言之，Qwen3.7-Max 是处理复杂、长时间运行的智能体工作流的有力候选模型，但对于有本地部署需求且注重隐私的团队而言，则需另作考虑。

近期 Hacker News 上的一场讨论展示了一项实验：Qwen3.7-Max 模型通过 35 小时的自主迭代，使内核速度提升了 10 倍。该 AI 进行了 432 次评估循环，在此期间，它不断编写代码、分析性能分析输出、诊断编译错误，并在增量收益陷入瓶颈时重新设计架构。评论者将这种自我改进循环与传统的遗传算法进行了类比，指出该 AI 实际上执行了非可导优化。尽管一些用户争论这一过程究竟是真正的飞跃，还是自动化试错的复杂演变，但该实验证明了模型能够通过迭代的运行时反馈，而非仅仅依赖既有知识来识别性能瓶颈。

原文

- Advertisement -

Alibaba gave Qwen3.7-Max a kernel optimization task on a hardware platform the model had never encountered before. No documentation or profiling data. No example kernels for the architecture. Just a task description, an existing implementation, and an evaluation script.

The model ran for 35 hours. It made 1,158 tool calls. It wrote, compiled, profiled, and rewrote the kernel repeatedly, diagnosing failures, fixing bugs, identifying blocks, and redesigning the architecture multiple times without anyone watching. After 30 hours it was still finding meaningful improvements.

The final result was a 10x speedup over the reference implementation.

For context: GLM 5.1 ran the same task and reached 7.3x. Kimi K2.6 reached 5x. DeepSeek V4 Pro reached 3.3x. The models that stopped early did so because they issued no tool calls for five consecutive rounds, they concluded they couldn’t make further progress and stopped. Qwen3.7-Max didn’t stop.

What the task actually was

The kernel in question is Extend Attention, a production component in SGLang, a widely used inference framework. Specifically it handles attention between newly generated tokens and a prefix KV-cache of up to 32K entries, a memory-bound, latency-critical operation that directly affects how fast LLMs serve responses.

The hardware was T-Head ZW-M890 PPUs, a processor architecture that wasn’t in any training data. The model had no prior knowledge of how it behaved. It started cold.

Over 35 hours it performed 432 kernel evaluations. Each cycle meant writing code, compiling it, running it, reading the profiling output, deciding what to change, and trying again. The model diagnosed compilation failures it hadn’t seen before, identified performance bottlenecks through runtime feedback rather than prior knowledge, and redesigned the kernel architecture multiple times when incremental improvements stopped working.

This matters because it tests something different from standard benchmarks. Most evaluations measure whether a model can produce a correct answer given a well-defined problem. This one measured whether a model could sustain coherent strategy across more than a thousand tool calls on an open-ended optimization problem with no human guidance. Those are different skills and most models don’t have it.

What Benchmarks Shows

qwen 3.7 max benchmarks — via: Qwen Blog

The numbers below are from Alibaba’s own evaluation.

Benchmark	Qwen3.7-Max	Claude Opus 4.6	DeepSeek V4 Pro
SWE-Verified	80.4	80.8	80.6
Terminal Bench 2.0	69.7	65.4	67.9
GPQA Diamond	92.4	91.3	90.1
HLE	41.4	40.0	37.7
HMMT 2026 Feb	97.1	96.2	95.2
BFCL-V4	75.0	76.7	70.6

On coding agents it trades blows with Opus 4.6 and DeepSeek rather than clearly beating either. Terminal Bench is the exception where it leads. The reasoning numbers are where the gap opens up more consistently, GPQA Diamond, HLE, and HMMT all show Qwen3.7-Max at or above the strongest available comparison points.

The remaining benchmarks gives you clear idea , if its a right model for your use case.

Why this model trains differently

Most models get better by seeing more text. Qwen3.7-Max got better by seeing more situations.

Alibaba calls it environment scaling. Instead of optimizing for specific benchmarks, they built a large and diverse set of agentic training environments, different tasks, different tools, different harnesses, and trained the model across all of them. The idea is the same as why a model trained on diverse text generalizes better than one trained on narrow text. Diversity of experience produces capability that transfers.

The practical result is cross-harness generalization. Qwen3.7-Max performs consistently whether it’s running through Claude Code, Qwen Code, or a custom tool-use framework. It learned to solve problems rather than learn the patterns of a specific scaffold. That’s rarer than it sounds, most agentic models quietly overfit to the evaluation setup they were trained on.

Limitations

It’s a proprietary API model. No open weights, no local deployment or self-hosting. For teams with data privacy requirements or anyone who wants to run models on their own infrastructure, that’s a hard stop regardless of the benchmark numbers.

The instruction following gap is also real. IFBench at 79.1 is strong but lower than some competitors on complex multi-step instruction adherence. For workflows that require strict formatting or on-point output structure across long sessions, that’s worth testing before committing.

And the benchmark table is entirely self-reported. Alibaba ran these evaluations. Independent reproduction will clear the complete picture.

Who is this for?

A model that ran autonomously for 35 hours on hardware it had never seen, kept improving past the 30-hour mark, and finished 10x faster than the reference implementation is not a normal result. The benchmark numbers are competitive with the best available models. The kernel run is in a different category of evidence entirely.

If you’re building agentic workflows and can work within a proprietary API, this is worth serious evaluation. If open weights are a requirement, it isn’t an option yet. That’s the whole story honestly told.