```MAI代码-1-闪电```
MAI-Code-1-Flash

原始链接: https://microsoft.ai/news/introducingmai-code-1-flash/

MAI-Code-1-Flash 是一款专为真实开发者工作流设计的编程模型,而非单纯针对合成基准测试。通过使用 GitHub Copilot 的生产环境工具进行训练,开发团队确保了该模型在仓库级任务、代码重构以及实际软件开发环境中的代理式编码方面表现卓越。 该模型的一项关键特性是自适应解决方案长度控制,使其能够动态调整推理深度。这使得模型在处理简单查询时更加简洁,而在处理复杂问题时又能提供更深入的分析,从而在完成相同任务时减少了高达 60% 的 Token 用量。这种效率提升转化为更低的延迟、更少的成本,以及为开发者带来更流畅、更快速的体验。 在利用生产级评估工具与 Claude Haiku 4.5 进行对比测试时,MAI-Code-1-Flash 在所有核心基准测试中均超越了竞争对手,特别是在 SWE-Bench Pro 上领先了 16 个百分点。最终,该模型证明了高精度与计算效率并非互斥,为生产级编码环境提供了一种更优质的工具。

微软推出了 **MAI-Code-1-Flash**,这是一款仅需 50 亿活跃参数即可在 SWE-Bench Pro 基准测试中达到 51% 准确率的新模型。尽管微软将其视为效率上的重大突破,但这一公告在 Hacker News 上引发了褒贬不一的反应。 支持者指出,该模型的性能足以与 Anthropic 的 Claude 3.5 Haiku 等顶尖替代品相媲美。然而,怀疑论者则持批评态度,质疑基准测试结果是否因“爬坡”(在测试数据上进行训练)而虚高,并对模型权重未开源表示不满。 此次讨论凸显了人们对 AI 编程代理更广泛的质疑:用户认为“基本正确”的代码远远不够,因为调试过程所带来的负担抵消了使用 AI 提升的生产力。此外,一些评论者建议,行业应将重心从单纯的编码能力转向高级系统设计。最后,该模型的发布网站也因糟糕的用户体验遭到批评,甚至有用户调侃称,网站本身的导航问题和当前 AI 编程助手的局限性一样令人沮丧。
相关文章

原文

Build for developers, not benchmarks

Coding models are most useful when they perform well in the same environment developers use every day. That is why we built MAI-Code-1-Flash with production workflows at the center, rather than optimizing only for benchmarks. The model was trained directly with GitHub Copilot harnesses used in production. This allows it to learn how to interact with surrounding tools and systems in agentic coding tasks, making it uniquely well suited to real-world Copilot workflows compared to other available models.

During training, we evaluated checkpoints across core software engineering tasks, repository question answering, refactoring, and telemetry-grounded tasks adapted from real GitHub Copilot usage. This alignment between training, evaluation, and production helps offline improvements translate into real-world developer quality.

Designed to maximize value per token

MAI-Code-1-Flash was trained with adaptive solution length control, which helps the model adjust the depth of its response to the task. It can stay concise for simpler requests and spend more reasoning budget when a problem requires deeper analysis or broader code changes. In practice, this means developers start seeing useful output sooner. We see MAI-Code-1-Flash solving harder problems with up to 60% fewer tokens. This helps reduce latency, lower cost, improve return on token, and make interactive workflows feel smoother.

Benchmark results in the production harness

To understand both quality and efficiency, we evaluated MAI-Code-1-Flash against Claude Haiku 4.5 on SWE-Bench Verified, SWE-Bench Pro, SWE-Bench Multilingual, and Terminal Bench 2 using the same production harness that developers use for their everyday coding tasks. We measured task success and the average number of solution tokens required to complete each task.

MAI-Code-1-Flash outperforms Claude Haiku 4.5 across all core coding benchmarks tested, with higher pass rates on all 4 evaluations, including a +16-point lead on the diverse, real-world tasks of SWE-Bench Pro (51.2% vs. 35.2%). It’s not just smarter; it’s leaner, solving harder problems with up to 60% fewer tokens on SWE-Bench Verified, proving that higher accuracy and greater efficiency are no longer a trade-off.

联系我们 contact @ memedata.com