Show HN：Python的速度，Rust的效率

Show HN：Python的速度，Rust的效率
Show HN: Python at the Speed of Rust

原始链接: https://blog.fxn.ai/python-at-the-speed-of-rust/

本文介绍了一种名为“Function”的工具，它可以将Python代码编译成原生代码，从而解决Python速度慢以及难以嵌入跨平台应用程序的问题。文章重点介绍了编译在AI中使用的计算密集型“融合乘加”（FMA）函数。Function使用符号追踪来捕获函数的操作，并将其表示为中间表示（IR）图。该图连同类型标注一起被转换为原生代码（示例中为C语言），允许类型传播。生成的原生代码可以跨平台编译。基准测试显示，编译后的Python FMA函数可以达到接近Rust的性能，尽管由于调用开销存在常数倍的性能下降。虽然只是一个概念验证，但Function有望加速科学计算、数据处理和AI工作负载在各种设备上的运行，从而使Python驱动的应用程序能够应用于单目深度估计、实时姿态检测和设备端LLM推理等领域。

Function (fxn.ai) 项目旨在弥合 Python 模型（例如 PyTorch 中的模型）及其部署之间的差距，提供类似 Rust 的性能，而无需开发者学习 Rust。它通过使用 Python 编译器将现有的 Python 代码编译成独立的原生二进制文件来实现这一点。关键在于利用 PyTorch 的符号追踪，它记录函数操作的中间表示 (IR) 图。然后将此 IR 图降低到 C++（很快会是 Rust）代码，并由大型语言模型 (LLM) 协助编写和验证所需的操作。要使用 Function，开发者可以使用 `@compile` 装饰器修饰函数，并使用 CLI 命令。虽然 Function 目前不支持所有 Python 功能（例如，类、lambda 表达式）或不是纯 Python 的外部库，但开发者计划使用大型语言模型 (LLM) 自动重新实现库功能。与 Mojo 不同，Function 不需要学习新的语言，并且可以从越来越多的用于生成生产代码的大型语言模型 (LLM) 中获益。该项目专注于编译特定函数，未来可能会扩展到完整的程序。

原文

Python is the most popular programming language in the world. It is an extremely simple and accessible language, making it the go-to choice for developers across numerous domains. It is used in everything from introduction to computer science classes; to powering the AI revolution we're all living through.

However, Python's convenience comes with two significant drawbacks: First, running an interpreted language results in much slower execution compared to native languages like C or Rust. Second, it is incredibly difficult to embed Python-powered functions (e.g. Numpy, PyTorch) into cross-platform consumer applications (e.g. web apps, mobile).

But what if we could compile Python into raw native code?

Compiling a Toy Function

Artificial Intelligence, and particularly Large Language Models (LLMs), rely heavily on matrix multiplications. These matrix operations, at their core, utilize a fundamental operation known as fused multiply-add (FMA):

def fma (x, y, z):
    """
    Perform a fused multiply-add.
    """
    return x * y + z

Hardware vendors like Nvidia provide specialized instructions that perform the FMA in a single step, reducing computational overhead and improving numerical precision. Given that LLMs perform billions of these operations, even minor performance variations can significantly affect overall efficiency.

result = fma(x=3, y=-1, z=2)
print(result)
# -1

Let's explore how to compile the fma function, allowing it to run at native speed, cross-platform.

Tracing the Function

We begin by capturing all operations performed within the function as a computation graph. We call this an Intermediate Representation (IR). This IR graph explicitly represents every operation—arithmetic operations, method calls, and data accesses—making it a powerful abstraction for compilation.

To build this graph, we leverage CPython's frame evaluation API to perform Symbolic Tracing. This allows us to introspect Python bytecode execution, capturing each instruction's inputs, operations, and outputs dynamically as the function executes. By tracing each Python operation in real-time, we construct an accurate IR of the function’s logic. For example:

from torch._dynamo.eval_frame import set_eval_frame

# Define a tracer
class Tracer:

    def __call__ (self, frame, _):
        print(frame.f_code, frame.f_func, frame.f_locals)

# Set the frame evaluation handler
tracer = Tracer()
set_eval_frame(tracer)

# Call the function
result = fma(x=3, y=-1, z=2)
print(result)
# <code object fma at 0x106c51ca0, file "fma.py", line 9> <function fma at 0x106ba4860> {'x': 3, 'y': -1, 'z': 2}
# -1

Skipping a few steps ahead, we end up with a graph that looks like this:

type           name       target        args
-------------- ---------- ------------- --------
input          x          x             ()
input          y          y             ()
input          z          z             ()
call_function  mul_result _operator.mul (x, y)
call_function  add_result _operator.add (mul_result, z)
output         output     output        (add_result,)

An astute reader might notice that in order to build the IR graph above, we need to actually invoke the fma function. And to do that, we need to pass in inputs with the correct types to the function. We can simply add type annotations to our fma function, and generate fake inputs to invoke the function:

def fma (x: float, y: float, z: float) -> float:
    """
    Perform a fused multiply-add.
    """
    return x * y + z

Lowering to Native Code

Now the real fun begins! With our IR graph and annotated input types, we start the process of lowering the IR graph into native code. Let's take the first operation in the graph, x * y:

We can write (*ahem* generate) a corresponding implementation of the _operator.mul operation in native code. For example, here's a C implementation:

float _operator_mul (float x, float y) {
    return x * y;
}

Notice that because of the return type of the native implementation above, the type of mul_result is now constrained to be a float. Zooming out, this means that given inputs with known types (i.e. from type annotations in Python) along with a native implementation of a Python operation, we can fully determine the native type of the operation's outputs. By repeating this process to subsequent operations in our IR graph, we can propagate native types through our entire Python function:

We can now cross-compile this native implementation for any platform we want (WebAssembly, Linux, Android, and much more). And that's how we get Python to run as fast as Rust—and run everywhere!

Compiling the Function

Let's use Function to compile the fma function based on the above process. First, install Function for Python:

# Run this in Terminal
$ pip install --upgrade fxn

Next, decorate the fma function with @compile:

from fxn import compile

@compile(
    tag="@yusuf/fma",
    description="Fused multiply-add."
)
def fma (x: float, y: float, z: float) -> float:
    """
    Perform a fused multiply-add.
    """
    return x * y + z

To compile a function with Function, use the @compile decorator.

Finally, compile the function using the Function CLI:

# Run this in terminal
$ fxn compile fma.py

Compiling the function with the Function CLI.

Let's Benchmark!

First, let's modify our fma function to perform the fused multiply-add repeatedly:

def fma (x: float, y: float, z: float, n_iter: int) -> float:
    for _ in range(n_iter):
        result = x * y + z
    return result

Next, we'll create an equivalent implementation in Rust:

use std::os::raw::c_int;

#[no_mangle]
pub extern "C" fn fma (x: f32, y: f32, z: f32, n_iter: c_int) -> f32 {
    let mut result = 0.0;
    for _ in 0..n_iter {
        result = x * y + z;
    }
    result
}

After compiling both, here's a graph of the performance on my MacBook Pro:

The compiled Python benchmark is slower than Rust by a constant factor because Function has extra scaffolding to invoke a prediction function, whereas the Rust implementation uses a direct call. You can inspect the generated native code and reproduce the benchmark with this repository:

GitHub - olokobayusuf/python-vs-rust: Python at the speed of Rust using the Function compiler.

Python at the speed of Rust using the Function compiler. - olokobayusuf/python-vs-rust

Wrapping Up

The prospect of being able to compile Python is very exciting to us. It means that we can accelerate scientific computing, realtime data processing, and AI workloads to run on many more devices—all from the convenience of Python.

Our compiler is still a proof-of-concept, but with it our design partners have been shipping applications into production, powering everything from monocular depth estimation to realtime pose detection. Up next? On-device LLM inference. Join the conversation:

Join the Function Discord Server!

Check out the Function community on Discord - hang out with 1317 other members and enjoy free voice and text chat.