Python 引用计数的最新优化

Python 引用计数的最新优化
Recent Optimizations in Python's Reference Counting

原始链接: https://rushter.com/blog/python-refcount/

## CPython 引用计数优化：借用引用 CPython 的内存管理很大程度上依赖于引用计数——跟踪有多少引用指向每个对象。在性能关键循环中频繁增加和减少这些计数会产生开销。Python 3.14 中引入了一种新的优化方法，通过 **`LOAD_FAST_BORROW`** 字节码指令来解决这个问题，该指令避免在加载局部变量时增加引用计数。其工作原理是在编译时执行 **静态生命周期分析**。本质上，编译器确定局部变量的生命周期是否保证其在特定代码块中的有效性。如果是，则可以“借用”引用而无需增加计数。这类似于 Rust 等语言中的概念。当变量在有限的范围内使用（例如单个循环迭代）进行简单操作且未被重新赋值时，此优化适用。CPython 使用控制流图来分析变量的生命周期。借用的对象会被标记，以防止由于跳过引用计数减少而过早释放。除了 `LOAD_FAST_BORROW` 之外，还有努力正在进行中，以减少 JIT 编译器中冗余的引用计数，但此功能默认情况下处于禁用状态，并且侧重于优化大量迭代后的循环。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交登录 Python引用计数的最新优化 (rushter.com) 8 分，来自 f311a 1小时前 | 隐藏 | 过去 | 收藏 | 3 条评论 zahlman 40分钟前 | 下一个 [–] 这似乎是一项非常重要的改进，应该在3.14版本中获得更多关注。它被埋在“新特性”文档中，但显然原始问题在 https://github.com/python/cpython/issues/130704 处，相应的PR在 https://github.com/python/cpython/pull/130708 处。回复 Waterluvian 31分钟前 | 上一个 [–] CPython性能的提升在过去五年的时间里成为一个巨大的关注点了吗？还是我个人的感知问题？回复 Bjartr 19分钟前 | 父级 [–] 由于Python在AI领域通常是一个重要的基石，性能改进在过去几年中受到了更多的关注。指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

It's been a while since I've written about CPython internals and its optimizations. My last article on garbage collection was written 8 years ago.

A lot of small optimizations were added since then. In this article, I will highlight a new optimization for reference counting that uses a static lifetime analysis.

Background on reference counting in CPython

Reference counting is the primary memory management technique used in CPython.

In short, every Python object (the actual value behind a variable) has a reference counter field that tracks how many references point to it. When an object's reference count drops to zero, the memory occupied by that object is immediately deallocated.

For hot loops, this can lead to significant overhead due to the frequent incrementing and decrementing of reference counts. The counter must be updated whenever an object is referenced or dereferenced, which hurts performance and trashes CPU caches.

So, when you read a variable in Python, you actually write to the memory as well.

For more details, you can read my article on garbage collection.

New bytecode instruction

Since Python 3.14, there is a new bytecode instruction called LOAD_FAST_BORROW. This is an optimized instruction that avoids incrementing the reference count when loading local variables.

Let's look at this simple function:

def advance(vx, vy, steps):
    x, y = 1, 1
    for _ in range(steps):
        x *= vx
        y *= vy
    return x, y

Since we check the value of vx and vy every iteration, their reference counts are incremented and decremented repeatedly.

Now, let's look at the bytecode output for different Python versions.

Python 3.13:

  7           RESUME                   0

  8           LOAD_CONST               1 ((1, 1))
              UNPACK_SEQUENCE          2
              STORE_FAST_STORE_FAST   52 (x, y)

  9           LOAD_GLOBAL              1 (range + NULL)
              LOAD_FAST                2 (steps)
              CALL                     1
              GET_ITER
      L1:     FOR_ITER                11 (to L2)
              STORE_FAST               5 (_)

 10           LOAD_FAST_LOAD_FAST     48 (x, vx)
              BINARY_OP               18 (*=)
              STORE_FAST               3 (x)

 11           LOAD_FAST_LOAD_FAST     65 (y, vy)
              BINARY_OP               18 (*=)
              STORE_FAST               4 (y)
              JUMP_BACKWARD           13 (to L1)

  9   L2:     END_FOR
              POP_TOP

 12           LOAD_FAST_LOAD_FAST     52 (x, y)
              BUILD_TUPLE              2
              RETURN_VALUE

Python 3.15:

  7           RESUME                   0

  8           LOAD_SMALL_INT           1
              LOAD_SMALL_INT           1
              STORE_FAST_STORE_FAST   67 (y, x)

  9           LOAD_GLOBAL              1 (range + NULL)
              LOAD_FAST_BORROW         2 (steps)
              CALL                     1
              GET_ITER
      L1:     FOR_ITER                19 (to L2)
              STORE_FAST               5 (_)

 10           LOAD_FAST_BORROW_LOAD_FAST_BORROW 48 (x, vx)
              BINARY_OP               18 (*=)
              STORE_FAST               3 (x)

 11           LOAD_FAST_BORROW_LOAD_FAST_BORROW 65 (y, vy)
              BINARY_OP               18 (*=)
              STORE_FAST               4 (y)
              JUMP_BACKWARD           21 (to L1)

  9   L2:     END_FOR
              POP_ITER

 12           LOAD_FAST_BORROW_LOAD_FAST_BORROW 52 (x, y)
              BUILD_TUPLE              2
              RETURN_VALUE

The only difference is that LOAD_FAST instructions were replaced with LOAD_FAST_BORROW. The LOAD_FAST_BORROW_LOAD_FAST_BORROW is just a shorthand for loading two variables using a single opcode.

The LOAD_FAST opcode means "load a local variable and increment its reference count". The LOAD_FAST_BORROW opcode means the same but without incrementing the reference count.

This optimization can only be applied when the following criteria are met:

/*
 * Strength reduce LOAD_FAST{_LOAD_FAST} instructions into faster variants that
 * load borrowed references onto the operand stack.
 *
 * This is only safe when we can prove that the reference in the frame outlives
 * the borrowed reference produced by the instruction. We make this tractable
 * by enforcing the following lifetimes:
 *
 * 1. Borrowed references loaded onto the operand stack live until the end of
 *    the instruction that consumes them from the stack. Any borrowed
 *    references that would escape into the heap (e.g. into frame objects or
 *    generators) are converted into new, strong references.
 *
 * 2. Locals live until they are either killed by an instruction
 *    (e.g. STORE_FAST) or the frame is unwound. Any local that is overwritten
 *    via `f_locals` is added to a tuple owned by the frame object.
 *
 * To simplify the problem of detecting which supporting references in the
 * frame are killed by instructions that overwrite locals, we only allow
 * borrowed references to be stored as a local in the frame if they were passed
 * as an argument. {RETURN,YIELD}_VALUE convert borrowed references into new,
 * strong references.
 *
 * Using the above, we can optimize any LOAD_FAST{_LOAD_FAST} instructions
 * that meet the following criteria:
 *
 * 1. The produced reference must be consumed from the stack before the
 *    supporting reference in the frame is killed.
 *
 * 2. The produced reference cannot be stored as a local.
 *
 * We use abstract interpretation to identify instructions that meet these
 * criteria. For each basic block, we simulate the effect the bytecode has on a
 * stack of abstract references and note any instructions that violate the
 * criteria above. Once we've processed all the instructions in a block, any
 * non-violating LOAD_FAST{_LOAD_FAST} can be optimized.
 */

Which, in simpler terms, means:

The value must only be used within one basic code block (e.g. no conditionals inside).
The value is used for simple operations (e.g. math) and not assigned to other variables.
The source value must not be changed during the execution

For those who are familiar with Rust lifetimes, this will sound very similar. Python implements a basic version of lifetime analysis and immutable borrowing. To do so, CPython uses control flow graphs to analyze the lifetime of local variables at compile time.

In my code example, you may notice that the x and y variables are also borrowed. This is because they are not modified at the time of borrowing. They are only borrowed once per iteration to load the actual value for multiplication.

When borrowing happens, each object is tagged as "borrowed" so that reference counting operations have no effect on them. Since the reference count is not incremented with the LOAD_FAST_BORROW, the decrement operation that a lot of opcodes implement could destroy the object prematurely.

Other improvements

There is also an ongoing effort to eliminate redundant reference counting in JIT compiled code. Unlike regular bytecode, the current implementation of JIT only optimizes loops and is disabled by default. When JIT is enabled, the analysis of a specific loop only triggers after 4000 of iterations of it.