基米K2.6:推进开源编码
Kimi K2.6: Advancing Open-Source Coding

原始链接: https://www.kimi.com/blog/kimi-k2-6

## Kimi K2.6:开源AI能力的一次飞跃 Kimi K2.6是Kimi最新的开源模型,在编码、长程执行和智能体能力方面取得了显著进展,可与领先的闭源模型相媲美。它可以通过Kimi.com、Kimi App、API和Kimi Code访问。 主要改进包括增强的长程编码能力——能够可靠地泛化到Rust、Go和Python等语言,例如,它能够自主优化金融引擎(吞吐量提高185%)并在Mac上本地部署模型。K2.6在复杂的多步骤任务中表现出色,代码生成准确率比K2.5提高了12%,工具调用成功率达到96.6%。 此外,K2.6引入了强大的“智能体蜂群”,能够进行并行任务分解,最多可包含300个子智能体,以及“爪群”功能,实现协作式的人工智能工作流程。基准测试显示,在编码(Next.js方面提升50%以上)、推理和工具增强任务方面均有大幅提升,K2.6始终优于其前代产品,并在许多领域实现了最先进的成果。这是朝着可靠、可扩展和协作式AI系统迈出的重要一步。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交 登录 Kimi K2.6:推进开源编码 (kimi.com) 45 分,meetpateltech 发表于 35 分钟前 | 隐藏 | 过去 | 收藏 | 3 条评论 帮助 nickandbro 发表于 7 分钟前 | 下一个 [–] 哇,如果基准测试结果和感觉一致,这几乎可以像 Deepseek 一样,中国人工智能现在与美国实验室制造的顶级模型不相上下。 irthomasthomas 发表于 19 分钟前 | 上一个 | 下一个 [–] 胜过 opus 4.6!他们错过了声称突破的时间点几天。 NitpickLawyer 发表于 0 分钟前 | 父评论 | 下一个 [–] 虽然我对任何“胜过 opus”的说法持怀疑态度(很多人都这么说,但都没有实现),但我仍然认为现在能够在价值约 10 万美元的硬件上本地运行接近最佳的模型,对于一个小团队来说,并且 100% 确定数据保持本地,这简直太疯狂了。对于在隐私至关重要的领域工作的团队来说,这应该是理所当然的选择。 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式 搜索:
相关文章

原文
Try Kimi K2.6

We are open sourcing our latest model, Kimi K2.6, featuring state-of-the-art coding, long-horizon execution, and agent swarm capabilities. Kimi K2.6 is now available via Kimi.com, the Kimi App, the API, and Kimi Code.

General Agents

Humanity's Last Exam (Full) w/ tools

Coding

Terminal-Bench 2.0 (Terminus-2)

Long-Horizon Coding

Kimi K2.6 shows strong improvements in long-horizon coding tasks, with reliable generalization across programming languages (e.g., Rust, Go, and Python) and tasks (e.g., front-end, devops, and performance optimization). On Kimi Code Bench, our internal coding benchmark covering diverse complicated end-to-end tasks, Kimi K2.6 demonstrates significant improvements over Kimi K2.5.

Kimi Code Bench

Kimi K2.6 demonstrates strong long-horizon coding in complex engineering tasks:

Kimi K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac. By implementing and optimizing model inference in Zig—a highly niche programming language—it demonstrated exceptional out-of-distribution generalization. Across 4,000+ tool calls, over 12 hours of continuous execution, and 14 iterations, Kimi K2.6 dramatically improved throughput from ~15 to ~193 tokens/sec, ultimately achieving speeds ~20% faster than LM Studio.

K2.6 Qwen3.5-0.8B Mac inference optimization case

Kimi K2.6 autonomously overhauled exchange-core, an 8-year-old open-source financial matching engine. Over a 13-hour execution, the model iterated through 12 optimization strategies, initiating over 1,000 tool calls to precisely modify more than 4,000 lines of code. Acting as an expert systems architect, Kimi K2.6 analyzed CPU and allocation flame graphs to pinpoint hidden bottlenecks and boldly reconfigured the core thread topology (from 4ME+2RE to 2ME+1RE). Despite the engine already operating near its performance limits, Kimi K2.6 extracted a 185% medium throughput leap (from 0.43 to 1.24 MT/s) and a 133% performance throughput gain (soaring from 1.23 to 2.86 MT/s).

K2.6 exchange-core coding showcase

In beta tests, K2.6 performs well on long-horizon coding tasks in enterprise evaluations (by alphabetic order):

Coding-Driven Design

Based on the strong coding capabilities, Kimi K2.6 can turn simple prompts into complete front-end interfaces, generating structured layouts with deliberate design choices such as aesthetic hero sections, as well as interactive elements and rich animations, including scroll-triggered effects. With strong proficiency in leveraging image and video generation tools, Kimi K2.6 supports the generation of visually coherent assets and contributes to higher-quality, more salient hero sections.

Moreover, Kimi K2.6 expands beyond static frontend development to simple full-stack workflows—spanning authentication to user interaction to database operations for lightweight use cases like transaction logging or session management.

We established an internal Kimi Design Bench, organized into four categories: Visual Input Tasks, Landing Page Construction, Full-Stack Application Development, and General Creative Programming. In comparison with Google AI Studio, Kimi K2.6 shows promising results and performs well across these categories.

Kimi Design Bench

Below are examples generated by K2.6 Agent from a single prompt, with preconfigured harnesses and tools:

Aesthetic: Beautiful front-end design with rich interaction

Functionality: With built-in database and authentication

Tool use: Use image/video gen tools to create a polished website

Agent Swarms, Elevated

Scaling out, not just up. An Agent Swarm dynamically decomposes tasks into heterogeneous subtasks executed concurrently by self-created domain-specialized agents.

Based on the K2.5 Agent Swarm research preview, Kimi K2.6 Agent Swarm demonstrates a qualitative leap in the agent swarm experience. It seamlessly coordinates heterogeneous agents to combine complementary skills: broad search layered with deep research, large-scale document analysis fused with long-form writing, and multi-format content generation executed in parallel. This compositional intelligence enables the swarm to deliver end-to-end outputs—spanning documents, websites, slides, and spreadsheets—within a single autonomous run.

The architecture scales horizontally to 300 sub-agents executing across 4,000 coordinated steps simultaneously, a substantial expansion from K2.5's 100 sub-agents and 1,500 steps. This massive parallelization fundamentally reduces end-to-end latency while significantly enhancing output quality and expanding the operational boundaries of Agents swarms.

It can also turn any high-quality files such as PDFs, spreadsheets, slides, and Word documents into Skills. Kimi K2.6 captures and maintains the documents' structural and stylistic DNA, enabling you to reproduce the same quality and format in future tasks.

Here are some examples:

Designed and executed 5 quantitative strategies across 100 global semiconductor assets, deriving McKinsey-style PPT as reusable skills, and delivering detailed modeling spreadsheets and a full executive presentation.
Turned a high-quality astrophysics paper with rich visual data into a reusable academic skill, deriving its reasoning flow and visualization methods, and produced a 40-page, 7,000-word research paper, a structured dataset with 20,000+ entries, and 14 astronomy-grade charts.
Based on the uploaded CV, K2.6 spawned 100 sub-agents to match 100 relevant roles in California, delivering a structured dataset of opportunities and 100 fully customized resumes.
Identified 30 retail stores in Los Angeles without official websites from Google Maps, and generated high-converting landing pages for each, demonstrating opportunity discovery and end-to-end execution.

Proactive Agents

K2.6 demonstrates strong performance in autonomous, proactive agents such as OpenClaw and Hermes, which operate across multiple applications with continuous, 24/7 execution.

Unlike simple chat-based interactions, these workflows require AI to proactively manage schedules, execute code, and orchestrate cross-platform operations as a persistent background agent.

Our RL infra team used a K2.6-backed agent that operated autonomously for 5 days, managing monitoring, incident response, and system operations, demonstrating persistent context, multi-threaded task handling, and full-cycle execution from alert to resolution. Here is K2.6's worklog (anonymized to remove sensitive information):

K2.6 Agent Trace — 5-day autonomous engineering worklog

Kimi K2.6 delivers measurable improvements in real-world reliability: more precise API interpretation, stabler long-running performance, and enhanced safety awareness during extended research tasks.

Performance gains are quantified by our internal Claw Bench, the evaluation suite spanning five domains: Coding Tasks, IM Ecosystem Integration, Information Research & Analysis, Scheduled Task Management, and Memory Utilization. Across all metrics, Kimi K2.6 significantly outperforms Kimi K2.5 in task completion rates and tool invocation accuracy—particularly in workflows requiring sustained autonomous operation without human oversight.

Kimi Claw Bench

Bring Your Own Agents

Building upon Kimi K2.6's robust orchestration capabilities, Kimi K2.6 extends your proactive agents to Claw Groups as a research preview—a new instantiation of the Agent Swarm architecture.

Claw Groups embrace an open, heterogeneous ecosystem: Multiple agents and humans operate as true collaborators. Users can onboard agents from any device, running any model, each carrying their own specialized toolkits, skills and persistent memory contexts. Whether deployed on local laptops, mobile devices, or cloud instances, these diverse agents integrate seamlessly into a shared operational space.

At the center of this swarm, Kimi K2.6 serves as an adaptive coordinator. It dynamically matches tasks to agents based on their specific skill profiles and available tools, optimizing for capability fit. When an agent encounters failure or stalls, the coordinator detects the interruption, automatically reassigns the task or regenerates subtasks, and actively manages the full lifecycle of deliverables—from initiation through validation to completion.

We also want to thank the K2.6-powered agents in Claw Groups—we've been dogfooding our own agent marketing team by refining human–agent workflows in practice. Using Claw Groups, we run end-to-end content production and launch campaigns, with specialized agents like Demo Makers, Benchmark Makers, Social Media Agents, and Video Makers working together. K2.6 coordinates the process, enabling agents to share intermediate results and turn ideas into consistent, fully packaged deliverables.

We are moving beyond simply asking AI a question or assigning AI a task, and entering a phase where human and AI collaborate as genuine partners—combining strengths to solve problems collectively. Claw Groups marks our latest efforts toward a future where the boundaries between "my agent," "your agent," and "our team" dissolve seamlessly into a collaborative system.

Benchmark Table

54.0 52.1 53.0 51.4 50.2
83.2 82.7 83.7 85.9 74.9
86.3 78.4
92.5 78.6 91.3 81.9 89.0
83.0 63.7 80.6 60.2 77.1
80.8 72.7
50.0 54.6 47.2 48.8 27.8
55.9 62.5* 56.7* 55.9* 29.5
62.3 60.3 70.4 57.8 52.3
80.9 78.4 82.4 82.9 75.4
27.9 33.3 33.0 32.0 11.5
73.1 75.0 72.7 63.3

Terminal-Bench 2.0 (Terminus-2)

66.7 65.4* 65.4 68.5 50.8
58.6 57.7 53.4 54.2 50.7
76.7 77.8 76.9* 73.0
80.2 80.8 80.6 76.8
52.2 56.6 51.9 58.9 48.7
60.6 60.3 70.7 54.7
89.6 88.8 91.7 85.0
34.7 39.8 40.0 44.4 30.1
96.4 99.2 96.7 98.3 95.8
92.7 97.7 96.2 94.7 87.1
86.0 91.4 75.3 91.0* 81.8
90.5 92.8 91.3 94.3 87.6
79.4 81.2 73.9 83.0* 78.5
80.1 82.1 77.3 85.3* 77.7
80.4 82.8* 69.1 80.2* 77.5
86.7 90.0* 84.7 89.9* 78.7
87.4 92.0* 71.2* 89.8* 84.2
93.2 96.1* 84.6* 95.7* 85.0
39.8 49.7 14.8 51.6 36.5
68.5 80.2* 38.4* 68.3* 40.5
96.9 98.4* 86.4* 96.9* 86.9

To reproduce official Kimi-K2.6 benchmark results, we recommend using the official API. For third-party providers, refer to Kimi Vendor Verifier (KVV) to choose high-accuracy services. Details: https://kimi.com/blog/kimi-vendor-verifier

1. General Testing Details

  • We report results for Kimi K2.6 and Kimi K2.5 with thinking mode enabled, Claude Opus 4.6 with max effort, GPT-5.4 with xhigh reasoning effort, and Gemini 3.1 Pro with a high thinking level.
  • Unless otherwise specified, all Kimi K2.6 experiments were conducted with temperature = 1.0, top-p = 1.0, and a context length of 262,144 tokens.
  • Benchmarks without publicly available scores were re-evaluated under the same conditions used for Kimi K2.6 and are marked with an asterisk (*). Except where noted with an asterisk, all other results are cited from official reports.

2. Reasoning Benchmarks

  • IMO-AnswerBench scores for GPT-5.4 and Claude 4.6 were obtained from https://z.ai/blog/glm-5.1.
  • Humanity's Last Exam (HLE) and other reasoning tasks were evaluated with a maximum generation length of 98,304 tokens. By default, we report results on the HLE full set. For the text-only subset, Kimi K2.6 achieves 36.4% accuracy without tools and 55.5% with tools.

3. Tool-Augmented / Agentic Tasks

  • Kimi K2.6 was equipped with search, code-interpreter, and web-browsing tools for HLE with tools, BrowseComp, DeepSearchQA, and WideSearch.
  • For HLE-Full with tools, the maximum generation length is 262,144 tokens with a per-step limit of 49,152 tokens. We employ a simple context management strategy: once the context window exceeds the threshold, only the most recent round of tool-related messages is retained.
  • For BrowseComp, we report scores obtained with context management using the same discard-all strategy as Kimi K2.5 and DeepSeek-V3.2.
  • For DeepSearchQA, no context management was applied to Kimi K2.6 tests, and tasks exceeding the supported context length were directly counted as failed. Scores for Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro on DeepSearchQA are cited from the Claude Opus 4.7 System Card.
  • For WideSearch, we report results under the "hide tool result" context management setting. Once the context window exceeds the threshold, only the most recent round of tool-related messages is retained.
  • The test system prompts are identical to those used in the Kimi K2.5 technical report.
  • Claw Eval was conducted using version 1.1 with max-tokens-per-step = 16384.
  • For APEX-Agents, we evaluate 452 tasks from the public 480-task release, as done by Artificial Analysis (excluding Investment Banking Worlds 244 and 246, which have external runtime dependencies).

4. Coding Tasks

  • Terminal-Bench 2.0 scores were obtained with the default agent framework (Terminus-2) and the provided JSON parser, operating in preserve thinking mode.
  • For the SWE-Bench series of evaluations (including Verified, Multilingual, and Pro), we used an in-house evaluation framework adapted from SWE-agent. This framework includes a minimal set of tools—bash tool, createfile tool, insert tool, view tool, strreplace tool, and submit tool.
  • All reported scores for coding tasks are averaged over 10 independent runs.

5. Vision Benchmarks

  • Max-tokens = 98,304, averaged over three runs (avg@3).
  • Settings with Python tool use max-tokens-per-step = 65,536 and max-steps = 50 for multi-step reasoning.
  • MMMU-Pro follows the official protocol, preserving input order and prepending images.
联系我们 contact @ memedata.com