序列性能优化的原则与方法论
Principles and Methodologies for Serial Performance Optimization

原始链接: https://danglingpointers.substack.com/p/principles-and-methodologies-for

## 串行性能优化与SysGPT的兴起 本文提炼了477篇研究论文,总结出优化*串行*代码(程序中无法并行化的部分)的八种核心技术。这些技术包括成本分摊、记忆化、预计算、延迟、近似、特化、硬件加速以及减少抽象开销。作者最初将这些技术呈现为性能工程师的基本技能,巧妙地强化了现有的专业知识。 然而,本文随后介绍了SysGPT,一个基于GPT-4o模型,并经过在SOSP和OSDI论文的优化案例上微调的模型,这些案例按照上述八种技术进行分类。SysGPT在提出有效优化建议方面明显优于标准的GPT-4o,其结果通常与已发表研究中发现的解决方案相匹配。 这项工作凸显了一种潜在的转变:传统上通过经验积累的技能现在可以通过人工智能复制。虽然专注于串行优化,但作者建议将这些方法扩展到可并行化的问题,并利用代码分析和性能分析数据自动化初始的问题/观察阶段。这项研究对性能工程中人类专业知识的未来角色提出了一个发人深省的问题。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 序列性能优化的原则和方法 (danglingpointers.substack.com) 4点 由 blakepelton 2小时前 | 隐藏 | 过去 | 收藏 | 讨论 考虑申请YC冬季2026批次!申请截止至11月10日 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

Principles and Methodologies for Serial Performance Optimization Sujin Park, Mingyu Guan, Xiang Cheng, and Taesoo Kim OSDI'25

This paper is a psychological trojan horse for computer nerds of a certain vintage. Every paragraph of sections 3 and 4 inflates the ego a bit more. One arrives at section 5 feeling good about their performance optimization skillset, and then one learns that these skills can be replaced by an LLM. A faint hissing sound reaches one’s ears as one’s ego deflates into a shriveled piece of plastic on the ground.

The authors reviewed 477 papers containing specific instances of serial performance optimization and boiled the techniques down into eight categories. Serial is the operative word here: this paper is about optimizing portions of code that cannot be parallelized. However, some of the techniques are applicable in computations that comprise a serial portion (the critical path) and a parallel portion.

Here are the techniques:

Amortizing a fixed cost over many items. For example, refactoring code so that some computation can be hoisted out of a (per-item) loop.

Store the results of a computation in memory so that it can be used later on. Memoization is a good example.

Compute something earlier than is necessary (possibly speculatively). This works in programs which alternate between serial and parallel portions. If the precomputation can be done during a parallel portion, then it can be thought of as “free” because it is off of the critical path.

Act like one of my high schoolers: don’t do work now when it could be done earlier. This works great in late spring, when a teacher realizes they have assigned too much work for the year so they cancel an assignment that most students (not mine) have already completed. Deferring interacts with other techniques:

  • Similar to precomputing, deferring can move work off of the critical path

  • Deferring can enable batching by deferring work until a large batch has been accumulated

Compute a quick and dirty approximation to the right answer rather than the exactly right answer.

Make a generic component more efficient for a specific use case. For example, a library user could pass hints at runtime which gives the library more information to make tradeoffs. Or profile guided optimization to enable the compiler to make better decisions when compiling the generic library.

Use a hardware accelerator, to avoid paying the Turing Tax.

Chip away at performance inefficiencies caused by the abstractions in a layered software architecture. For example, DPDK allows applications to bypass many networking abstractions.

After finishing section 4 of the paper, I felt pretty good. My ego happily agreed with a world view that says that serial performance optimization can be expressed in terms of 8 “basis vectors”, all of which I had extensive experience with.

And then came section 5.

The authors fine-tuned GPT-4o with a dataset derived from analyzing SOSP and OSDI papers. Each item in the dataset comprises a problem description, observations, and solutions (in terms of the 8 techniques described earlier). The authors made data and scripts available here. The fine-tuned model is called SysGPT.

Table 4 shows example inputs (problems + observations), the output from GPT-4, the output from SysGPT, and the actual solution from a real-world paper. Here is an example:


Table 5 has quantitative results, where each model is only asked to choose a subset of the 8 optimization strategies for a given problem (N represents the number of example problems given to the baseline GPT-4o model in a prompt):

It would be interesting to extend these recipes to also include problems which can be parallelized.

This paper assumes that the problem and observations are produced by humans. It would be fascinating to see how much of that can be automated. For example, a model could have access to source code and profiling information, and output a set of observations about system performance before optimization.

联系我们 contact @ memedata.com