It's been a while since I've written about CPython internals and its optimizations. My last article on garbage collection was written 8 years ago.
A lot of small optimizations were added since then. In this article, I will highlight a new optimization for reference counting that uses a static lifetime analysis.
Background on reference counting in CPython
Reference counting is the primary memory management technique used in CPython.
In short, every Python object (the actual value behind a variable) has a reference counter field that tracks how many references point to it. When an object's reference count drops to zero, the memory occupied by that object is immediately deallocated.
For hot loops, this can lead to significant overhead due to the frequent incrementing and decrementing of reference counts. The counter must be updated whenever an object is referenced or dereferenced, which hurts performance and trashes CPU caches.
So, when you read a variable in Python, you actually write to the memory as well.
For more details, you can read my article on garbage collection.
New bytecode instruction
Since Python 3.14, there is a new bytecode instruction called LOAD_FAST_BORROW.
This is an optimized instruction that avoids incrementing the reference count when loading local variables.
Let's look at this simple function:
def advance(vx, vy, steps):
x, y = 1, 1
for _ in range(steps):
x *= vx
y *= vy
return x, y
Since we check the value of vx and vy every iteration, their reference counts are incremented and decremented repeatedly.
Now, let's look at the bytecode output for different Python versions.
Python 3.13:
7 RESUME 0
8 LOAD_CONST 1 ((1, 1))
UNPACK_SEQUENCE 2
STORE_FAST_STORE_FAST 52 (x, y)
9 LOAD_GLOBAL 1 (range + NULL)
LOAD_FAST 2 (steps)
CALL 1
GET_ITER
L1: FOR_ITER 11 (to L2)
STORE_FAST 5 (_)
10 LOAD_FAST_LOAD_FAST 48 (x, vx)
BINARY_OP 18 (*=)
STORE_FAST 3 (x)
11 LOAD_FAST_LOAD_FAST 65 (y, vy)
BINARY_OP 18 (*=)
STORE_FAST 4 (y)
JUMP_BACKWARD 13 (to L1)
9 L2: END_FOR
POP_TOP
12 LOAD_FAST_LOAD_FAST 52 (x, y)
BUILD_TUPLE 2
RETURN_VALUE
Python 3.15:
7 RESUME 0
8 LOAD_SMALL_INT 1
LOAD_SMALL_INT 1
STORE_FAST_STORE_FAST 67 (y, x)
9 LOAD_GLOBAL 1 (range + NULL)
LOAD_FAST_BORROW 2 (steps)
CALL 1
GET_ITER
L1: FOR_ITER 19 (to L2)
STORE_FAST 5 (_)
10 LOAD_FAST_BORROW_LOAD_FAST_BORROW 48 (x, vx)
BINARY_OP 18 (*=)
STORE_FAST 3 (x)
11 LOAD_FAST_BORROW_LOAD_FAST_BORROW 65 (y, vy)
BINARY_OP 18 (*=)
STORE_FAST 4 (y)
JUMP_BACKWARD 21 (to L1)
9 L2: END_FOR
POP_ITER
12 LOAD_FAST_BORROW_LOAD_FAST_BORROW 52 (x, y)
BUILD_TUPLE 2
RETURN_VALUE
The only difference is that LOAD_FAST instructions were replaced with LOAD_FAST_BORROW.
The LOAD_FAST_BORROW_LOAD_FAST_BORROW is just a shorthand for loading two variables using a single opcode.
The LOAD_FAST opcode means "load a local variable and increment its reference count".
The LOAD_FAST_BORROW opcode means the same but without incrementing the reference count.
This optimization can only be applied when the following criteria are met:
/*
* Strength reduce LOAD_FAST{_LOAD_FAST} instructions into faster variants that
* load borrowed references onto the operand stack.
*
* This is only safe when we can prove that the reference in the frame outlives
* the borrowed reference produced by the instruction. We make this tractable
* by enforcing the following lifetimes:
*
* 1. Borrowed references loaded onto the operand stack live until the end of
* the instruction that consumes them from the stack. Any borrowed
* references that would escape into the heap (e.g. into frame objects or
* generators) are converted into new, strong references.
*
* 2. Locals live until they are either killed by an instruction
* (e.g. STORE_FAST) or the frame is unwound. Any local that is overwritten
* via `f_locals` is added to a tuple owned by the frame object.
*
* To simplify the problem of detecting which supporting references in the
* frame are killed by instructions that overwrite locals, we only allow
* borrowed references to be stored as a local in the frame if they were passed
* as an argument. {RETURN,YIELD}_VALUE convert borrowed references into new,
* strong references.
*
* Using the above, we can optimize any LOAD_FAST{_LOAD_FAST} instructions
* that meet the following criteria:
*
* 1. The produced reference must be consumed from the stack before the
* supporting reference in the frame is killed.
*
* 2. The produced reference cannot be stored as a local.
*
* We use abstract interpretation to identify instructions that meet these
* criteria. For each basic block, we simulate the effect the bytecode has on a
* stack of abstract references and note any instructions that violate the
* criteria above. Once we've processed all the instructions in a block, any
* non-violating LOAD_FAST{_LOAD_FAST} can be optimized.
*/
Which, in simpler terms, means:
- The value must only be used within one basic code block (e.g. no conditionals inside).
- The value is used for simple operations (e.g. math) and not assigned to other variables.
- The source value must not be changed during the execution
For those who are familiar with Rust lifetimes, this will sound very similar. Python implements a basic version of lifetime analysis and immutable borrowing. To do so, CPython uses control flow graphs to analyze the lifetime of local variables at compile time.
In my code example, you may notice that the x and y variables are also borrowed.
This is because they are not modified at the time of borrowing.
They are only borrowed once per iteration to load the actual value for multiplication.
When borrowing happens, each object is tagged as "borrowed" so that reference counting operations have no effect on them.
Since the reference count is not incremented with the LOAD_FAST_BORROW, the decrement operation that a lot of opcodes
implement could destroy the object prematurely.
Other improvements
There is also an ongoing effort to eliminate redundant reference counting in JIT compiled code. Unlike regular bytecode, the current implementation of JIT only optimizes loops and is disabled by default. When JIT is enabled, the analysis of a specific loop only triggers after 4000 of iterations of it.