Abstraction boundaries are optimization boundaries

原始链接: https://blog.snork.dev/posts/abstraction-boundaries-are-optimization-boundaries.html

The N+1 query problem arises when code inefficiently makes one database query per item in a collection, instead of a single bulk query. Traditional solutions involve explicitly instructing the ORM to fetch data in bulk. However, an alternative approach is to elevate the abstraction boundary, integrating the ORM more deeply into the language itself. This allows the compiler to understand and optimize ORM operations by defining rewrite rules, effectively merging the N queries into one. This is similar to Haskell's stream fusion optimization using rewrite rules for lists. The key principle is that by raising the abstraction boundary, we simultaneously expand the scope for optimization. This works best in declarative or pure languages where low-level execution details are abstracted away, enabling more aggressive compiler optimizations across the entire system, including database interactions.

这篇 Hacker News 讨论串探讨了抽象边界和优化之间的关系，尤其是在对象关系映射器 (ORM) 和数据库查询的背景下。原文认为，抽象边界可能会阻碍优化，因为像 ORM 这样的工具可能无法预测和优化复杂的查询模式，从而导致 N+1 查询等问题。评论者们同意核心观点，但提出了细微之处。一些人指出，像 Haskell 这样的语言由于其声明式特性，可以在模块边界进行优化。SQL 本身就被强调为一种声明式语言，允许关系数据库管理系统 (RDBMS) 执行大量的优化。另一些人认为，即使在像 Python 这样的语言中使用 Django 的 ORM，提高抽象级别也可以通过允许延迟计算和查询联合提供更多优化机会。讨论还涉及编译器在理解数据存储细节方面的局限性，以及在 ORM 设计中平衡封装和效率的挑战。一些人建议，更好的抽象接口应该关注能力而不是实现，并且像 Haxl 这样的替代方案可以减轻这些问题。

The N+1 query problem occurs when your application code sends one SQL query per element in a collection. The N queries are redundant; since all of the data is in the database already, a single query should be enough.

This problem is usually caused by a leaky abstraction; the ORM, or whatever database abstraction you are using, can’t anticipate that it would need to send N queries, so it can’t automatically optimize this down to a single query. The solution is usually to move the abstraction boundary down, and explicitly tell the ORM that you will need to fetch a set of rows in bulk, rather than repeatedly fetching single rows.

But what if we could do it the other way around? Can we solve the problem by moving the abstraction boundary up instead?

The problem with the ORM example is that the compiler doesn’t know what the ORM does, and as such it can’t optimize it; to the compiler it’s just another piece of userland code. However, what if we raise the abstraction boundary and make the ORM part of the language? This means that we could formulate rewrite rules for the ORM, allowing it to eg merge the N queries into a single query.

There are other examples of this: Haskell has support for adding rewrite rule pragmas to libraries; among other things this is used to optimize list operations using stream fusion. However, this only works since Haskell is declarative / pure; the low-level operational semantics (like evaluation order) are abstracted away from the programmer, and as such are amenable to optimization.

I think there’s an interesting pattern here: By raising the abstraction boundary we have also raised the optimization boundary.