大型语言模型现已变得复杂
LLMs Are Complicated Now

原始链接: https://ianbarber.blog/2026/06/19/llms-are-complicated-now/

机器学习架构已经从早期 Transformer 整洁、统一的结构,演变为日益复杂且异构的系统。专家混合模型(Mixture-of-Experts)、多样的注意力机制以及多模态集成等技术,引入了与推荐系统发展轨迹相似的架构复杂性,即效率与性能需求变得紧密耦合。 随着研究步伐的加快,行业面临一个悖论:虽然我们依赖 AI 智能体来优化代码,但却缺乏验证这些优化所需的必要可组合基准。研究人员无法等待自定义融合内核(custom-fused kernels)来测试新的架构变体;他们需要在不牺牲基准性能的前提下快速迭代。 解决方案在于优先考虑系统设计中的**可组合性**。PyTorch 的 FlexAttention 等工具代表了这种转变,它允许开发者通过模板生成高效内核,这些内核在设计上具有可验证性和模块化。归根结底,将架构分解为灵活、可组合组件的能力,对于 AI 的进步至关重要,其重要性不亚于构建在这些架构之上的智能体框架。该领域不应仅仅依赖自动化优化,而必须为未来进行设计,使架构探索与性能从一开始就从根本上交织在一起。

```Hacker News 新闻 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 LLMs 现在变得复杂了 (ianbarber.blog) 15 分点 由 matt_d 于 9 小时前发布 | 隐藏 | 过往 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:```
相关文章

原文

Back in 2022 and 2023 there were two big branches of machine learning happening at Meta1. The LLM work that led to Llama was a clean, smooth stack of repeated Transformer modules; the recommendation systems graphs were, by contrast, terrifying. Luckily, the industry has remedied that state of affairs by making LLMs a lot more complicated.

Seb Raschka maintains an excellent gallery of model architectures. You can use it to diff two of the best open models of their respective eras, Llama 3 and Nemotron 3 Ultra.

Attention might be all you need, but modern models certainly use a lot of different variants of it: query grouping, compressed, sparse, linear, sliding-window and more. Mixture-of-Experts added selective routing to feed-forward layers, and we have since started routing just about everything else too, from attention blocks to the residual stream. Vision and audio encoders have gone from bolted on to mixed-in, and models have scaled to run at inference time across multiple GPUs, which throws comms ops in that add extra boundaries in the middle of your model.

This is not too different from what happened with recsys. The basic architecture of recommendation systems, for the best part of a decade, was a relatively straightforward two-tower sparse neural net. The complexity came from the tension between the need to continually increase capabilities and the need to stay efficient, particularly for inference.

It’s tempting to assume that agents will Fix This: that you’ll hand your PyTorch or JAX definition to Claude Telenovela or whatever and have it generate optimally fused kernels2. To make that work you need a fixed, usable baseline to make sure that what is generated is… right.

What happened with recsys was that the gap between performance being an optimization and performance being a necessity became very, very small. Conceptually you can keep a pure model definition that gives you a baseline; in practice, training and testing a model takes significant resources and performance improvements become load-bearing.

If you want to swap attention variant A for variant B, you can afford for B to be ten percent slower. You probably can’t afford for it to be an order-of-magnitude worse. If A is fused and optimized, you need at least a partially fused and optimized version of B before you can even tell whether it’s worth exploring. The research iteration loop demands a different kind of flexibility than just “optimize this known quantity”. You can’t hand-fuse your way back without investing significant time that might not be worth it, and you can’t generate your way forward without a baseline to check. The only way out is to design for composability up front.

One of my favorite kernel developments of the last few years was FlexAttention in PyTorch, which took a whole class of attention operations and allowed you to generate kernels for them, via Triton templates. It built on a huge body of work in attention kernels, and it was designed to be composable and verifiable up front: you can explore with only a very mild impact to performance.

Andrej Karpathy recently joined Anthropic, in part to develop richer auto-research-style loops at the frontier. As he has spent the last few years showing, though, being able to cut architectures to their essence and make them composable is as important as a clever agentic setup in climbing that kind of hill.

联系我们 contact @ memedata.com