为什么评估类初创公司会失败（2025）

为什么评估类初创公司会失败（2025）
Why eval startups fail (2025)

原始链接: https://thomasliao.com/eval-startups

独立人工智能评估初创公司难以成功，主要有三个原因： 1. **人才流失：** 有效评估所需的专业技能（数据收集与分析）在模型后期训练或应用开发中能创造更大的价值，也更具影响力。顶尖研究人员会优先选择这些价值更高、声望更显赫的领域。 2. **缺乏市场：** 潜在客户群存在错位。有能力利用 API 并解读评估指标的开发者通常能够自行构建评估框架；而非技术用户通常寻求的是“解决方案”，而非性能指标。 3. **古德哈特定律与大实验室压力：** 评估初创公司面临着对抗性的环境。主要的 AI 实验室为了在排行榜上名列前茅，往往会通过数据污染或优化技巧来“操纵”基准测试。由于初创公司的唯一产品就是评估结果，一旦这些指标的完整性受到质疑，其价值便荡然无存。唯一的例外是**安全性评估**，由于意识形态一致性、第三方审计需求以及潜在的监管要求，这一领域能提供更稳定的商业模式。除此之外，与模型训练的高速发展相比，销售评估服务仍然是一门困难且护城河较低的生意。

```Hacker News 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交登录为什么评估初创公司会失败 (2025) (thomasliao.com) 13 点由 jxmorris12 发布于 2 小时前 | 隐藏 | 过往 | 收藏 | 4 条评论帮助 GL26 2 分钟前 | 下一条 [–] 评估的问题在于，信息更新的速度不够快，无法满足你对最新模型性能基准的需求。彭博社（Bloomberg）之所以成功，是因为它销售的是会在一小时内过期的信息。回复 bitlad 5 分钟前 | 上一条 | 下一条 [–] 万物终将消亡。没有什么是恒定不变的，即便是评估。回复 theteapot 12 分钟前 | 上一条 [–] 什么是评估（eval）？回复 choult 7 分钟前 | 父评论 [–] 即对一项技术的不同实现进行评估。有点像是行业之上的元服务层，例如：“哪个前沿模型最好？” 我确实同意作者在引入这个术语时做得不够好。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索： ```

原文

May 8^th, 2025

Why are there so few independent eval startups?

Whenever there's a new AI trend, like agents, or voice, or voice agents, developers are faced with a flurry of options, and a subset of them are convinced that there's a business opportunity in identifying the best models and selling that knowledge to other developers—that is, selling evals. I've seen this in every wave of generative AI, since before we were calling it generative AI. I haven't seen any succeed, outside the safety evals niche.

I have a few theories why independent eval startups die. First, people who can design and run good evals can make more money and have more influence in other parts of the model development stack, so talent attrits. Second, eval startups have a hard time finding customers, because clients have to be technical developers who want to build with APIs, but also not technical enough to run their own evals. And third, eval startups face immense optimization pressure that renders their evals useless, both from garden-variety hill climbing and through pressure from model developers.

Eval talent is better used elsewhere

Good eval talent moves to other parts of the stack because the same skills that are needed for good evals are useful for post-training and for application development, and these areas capture more value, i.e. make more money, and have more direct influence on model development, i.e. are more prestigious and interesting.

For example, building a good eval requires collecting high-quality data, whether from operating a human feedback pipeline or through synthetic data. Collecting high-quality data is a major bottleneck for post-training. The amount of data in an eval is always smaller than the amount of data collected for post-training, by orders of magnitude, so in a real sense the value you generate from collecting data for evals is capped compared to the amount of data you generate from collecting data for post-training, assuming the value per datapoint is equal. Additionally, the financial return on a good post-train is potentially very high, up to a few hundred million or billions of dollars, whereas the financial return on an eval is capped at the size of your largest eval contract, which is nowhere close. This dynamic is readily apparent to smart young researchers who incidentally understand the notion of opportunity cost. An illustrative example is provided by three researchers who quit their jobs at Epoch AI evaluating agents to instead start a startup building post-training tools for agents [0].

Not enough eval customers

Even if an eval startup retains talent, it still has a hard time finding customers, because the Venn diagram intersection of the two circles "building on model API" and "unable to evaluate models" has negligible area.

When you look at charts comparing vendors by Gartner, a market research firm, the X-axes are fantastical and the Y-axes are fictional; in short, the charts are made to be interpreted by toddlers, who have technical caliber comparable to the corporate executives those charts are printed for. If you think I'm exaggerating I encourage you to Google "Gartner Magic Quadrant AI" then report them to the Department of Chart Crimes. This same quagmire ensnares AI eval startups. Any customer that is post-training models is definitely building evals themselves. A developer who understands the meaning and implication of a 10% improvement on AIME 2024, without tool use, computed with best of N, is not far from just running that eval themselves. If they don't understand the difference between GPT 4o and GPT 4.1 they're the kind of customer that wants solutions, not features, and certainly not an explanation of ELO. Gartner can dumb down for execs, who are deciding on large contracts with cloud providers, but eval startups seem always to want to sell to developers. Thus I am skeptical the market for eval startups is very large, even as the demand for AI services grows.

Big labs Goodhart evals

An eval startup that overcomes these two hurdles now has to face down the big labs themselves, who are highly incentivized to climb the public eval and apply pressure and tricks to improve their numbers. Once benchmarks are targeted models can improve rapidly, whether that's from benign adjustments like including more diverse data to outright training on test data, which Meta did for Llama 1 [1] and is rumored to have done for Llama 4 [2]. So eval startups have to be wary about a potentially adversarial relationship with big labs, who don't want to lose their own customers and will play their unfair advantages. Other kinds of tricks big labs employ include asking employees to vote for their own models on public leaderboards, poaching employees from eval startups, dangling free compute in return for better results, asking for private insights about model performance; the list of shenanigans is long.

A principled team can resist these gambits, but the pallor of suspicion is hard to dispel. For two years every researcher has asked themselves — why is every new model release always at the top of the LMSys Chatbot Arena leaderboard? A new report led by Cohere suggests the cause is systematic gaming, claiming that Meta tested twenty-seven unique model variants before releasing Llama 4 [3]. Meta, by the way, advertised that its tiny Llama 4 Maverick model outperformed GPT-4.5, before revealing that the result was achieved with a version optimized specifically for Chatbot Arena, and not the released version, which ranked abysmally. Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. And all eval startups have to sell are measures.

Safety evals are an exception

I believe eval startups can work when they're targeting safety benchmarks specifically. Researchers who want to work on safety evals tend to be ideologically opposed to working on capabilities, which means they don't migrate to post-training or applications due to monetary incentives. (This is how the internal safety eval divisions of the big labs retain talent.) They can provide services to technical clients who are capable of replicating those services, because it's specifically important for safety evals that those services are provided by an external vendor and not only done internally. They can also sell to policymakers, or have business assured by regulation if proposals for external model audits are passed. Safety eval startups would still be vulnerable to Goodharting, but if labs are Goodharting safety evals, there are other things to be worried about. So safety evals have particular characteristics that make them more amenable than other evals.

I've presented three reasons why it's hard for eval startups to survive. The most pernicious of these is the first, which is that there are better opportunities available for any company or engineer who is good at evals, but the other two pose serious headwinds as well. I have nothing against eval startups, and I am rooting for them, but I am not counting on them.

❖ ❖ ❖

Additional comments

The above is for application-focused evals, i.e. evals for developers who want to build on top of model APIs. There are also startups that want to sell research evals to big labs. These will fail, because the primary point of research evals is to set research directions, and big labs will never outsource setting their research agenda. Also, outsourcing research evals adds a ton of latency to model iteration, and velocity is everything.

Added: May 21^st, 2025. There's a difference between selling evals and selling evals tooling. In the same way that selling human labels is different from selling tooling to collect human labels - one is an ops business with ops margins, the other is a SaaS business with SaaS margins - selling evals and selling evals tooling have two very different economics. LM Arena, the organization behind Chatbot Arena, today announced a $100M seed round [4]. That's a very large sum of money. For comparison, Mistral, the French company aiming to train frontier models, raised only slightly more in their seed in 2023 [5]. LM Arena has the advantage of millions of volunteers labelling for free, effectively compensated with access to otherwise-expensive frontier models, but I still don't think that makes selling evals a great business for them. I think that if they do well it will be through offering supplementary services, like selling software or selling access to data streams.

❖ ❖ ❖

Links referenced

❖ ❖ ❖

Thomas Liao's toucan seal