基米供应商验证器 – 验证推理提供程序的准确性
Kimi vendor verifier – verify accuracy of inference providers

原始链接: https://www.kimi.com/blog/kimi-vendor-verifier

Kimi 正在开源 Kimi Vendor Verifier (KVV) 项目,以解决开源 AI 模型生态系统中的一个关键问题:确保在不同平台上的实现一致且*正确*。他们发现基准测试结果的广泛差异并非由于模型缺陷,而是由于部署过程中的参数处理不当和基础设施问题。 KVV 提供六个基准测试——包括参数强制、多模态流水线、长输出生成、工具使用和代理编码的测试——以系统地验证推理准确性。它专注于识别与真实模型缺陷不同的“工程实现偏差”。 Kimi 正在积极与 vLLM 和 SGLang 等社区合作,修复根本原因,并提供预发布模型访问以供供应商验证。一个公共排行榜将跟踪供应商的性能,从而提高透明度和可问责性。目标是通过保证模型在任何地方都能按预期工作,从而建立对开源模型的信任。

## Kimi 的供应商验证工具:摘要 Kimi 发布了一款“供应商验证工具”,旨在确保推理提供商(为用户运行 AI 模型的服务)的准确性。该工具测试提供商是否提供与原始模型能力一致的结果。 Hacker News 上的讨论强调了这种方法的潜力与局限性。虽然被认为是迈向透明化的一步,但评论员指出,该验证工具可能无法防止恶意提供商故意使用更便宜、质量较低的模型,并检测/规避测试。 人们也对验证工具的可扩展性表示担忧,因为它运行时间长达 15 小时,并且需要大量资源。然而,总体情绪是积极的,用户指出提供商在未经用户知情的情况下悄悄降低模型性能(“量化等级”)是很常见的问题,因此标准的验证流程很有价值。其他 AI 实验室也被鼓励开发类似的工具。
相关文章

原文

Alongside the release of the Kimi K2.6 model, we are open-sourcing the Kimi Vendor Verifier (KVV) project, designed to help users of open-source models verify the accuracy of their inference implementations.

Not as an afterthought, but because we learned the hard way that open-sourcing a model is only half the battle. The other half is ensuring it runs correctly everywhere else.

Official Evaluation Results

You can click here to access the Kimi API K2VV evaluation results for calculating the F1 score.

Why We Built KVV

From Isolated Incidents to Systemic Issues

Since the release of K2 Thinking, we have received frequent feedback from the community regarding anomalies in benchmark scores. Our investigation confirmed that a significant portion of these cases stemmed from the misuse of Decoding parameters. To mitigate this immediately, we built our first line of defense at the API level: enforcing Temperature=1.0 and TopP=0.95 in Thinking mode, with mandatory validation that thinking content is correctly passed back.

However, more subtle anomalies soon triggered our alarm. In a specific evaluation on LiveBenchmark, we observed a stark contrast between third-party API and official API. After extensive testing of various infrastructure providers, we found this difference is widespread.

This exposed a deeper problem in the open-source model ecosystem: The more open the weights are, and the more diverse the deployment channels become, the less controllable the quality becomes.

If users cannot distinguish between "model capability defects" and "engineering implementation deviations," trust in the open-source ecosystem will inevitably collapse.

Our Solution

Six Critical Benchmarks (selected to expose specific infra failures):

  1. Pre-Verification: Validates that API parameter constraints (temperature, top_p, etc.) are correctly enforced. All tests must pass before proceeding to benchmark evaluation.
  2. OCRBench: 5 minutes smoke test for multimodal pipelines.
  3. MMMU Pro: Verify Vision input preprocessing by testing diverse visual inputs.
  4. AIME2025: Long-output stress test. Catches KV cache bugs and quantization degradation that short benchmarks hide.
  5. K2VV ToolCall: Measures trigger consistency (F1) and JSON Schema accuracy. Tool errors compound in agents; we catch them early.
  6. SWE-Bench: Full agentic coding test. (Not open sourced due to dependency of sandbox)

Upstream Fix: We embed with vLLM/SGLang/KTransformers communities to fix root causes, not just detect symptoms.

Pre-Release Validation: Rather than waiting for post-deployment complaints, we provide early access to test models. This lets infrastructure providers validate their stacks before users encounter issues.

Continuous Benchmarking: We will maintain a public leaderboard of vendor results. This transparency encourages vendors to prioritize accuracy.

Testing Cost Estimation

We completed full evaluation workflow validation on Two NVIDIA H20 8-GPU servers, with sequential execution taking approximately 15 hours. To improve evaluation efficiency, scripts have been optimized for long-running inference scenarios, including streaming inference, automatic retry, and checkpoint resumption mechanisms.

An Open Invitation

Weights are open. The knowledge to run them correctly must be too.

We are expanding vendor coverage and seeking lighter agentic tests. Contact Us: [email protected]

联系我们 contact @ memedata.com