```Show HN: cuTile Rust：Rust 语言中安全且无数据竞争的 GPU 内核```

```Show HN: cuTile Rust：Rust 语言中安全且无数据竞争的 GPU 内核```
Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust

原始链接: https://github.com/nvlabs/cutile-rs

**cuTile Rust** 是一个旨在利用地道的 Rust 语言，实现内存安全且无数据竞争的 GPU 内核编程的研究项目。通过将 Rust 的所有权模型扩展至 GPU 边界，它通过张量分区和显式共享确保了内存管理的安全性。该系统利用 `#[cutile::module]` 宏，通过 CUDA Tile IR 将捕获的 Rust AST（抽象语法树）即时编译（JIT）为高效的 GPU cubin。它支持同步、异步以及 CUDA 图执行模型。其性能极具竞争力：在 NVIDIA B200 GPU 上，cuTile Rust 可达到峰值内存带宽和稠密 FP16 性能的 91-92%，在没有可测量的安全开销的情况下，媲美底层实现的性能。该项目目前处于活跃开发阶段，包含诸如 **Grout**（一款高性能 Qwen3 推理引擎）等实际应用。虽然它主要面向进阶用户（要求 sm_80+ 硬件及 CUDA 13.3），但它证明了现代、安全的语言抽象完全能够提供顶尖的 GPU 性能。该项目已开源（Apache 2.0），欢迎社区反馈以共同塑造其在 Rust 生态系统中的未来。详情请参阅论文《Fearless Concurrency on the GPU》（arXiv:2606.15991）。

抱歉。

原文

cuTile Rust (cutile-rs) is a tile-based system for writing memory-safe, data-race-free GPU kernels in idiomatic Rust. It extends Rust's ownership discipline across the GPU launch boundary: mutable tensors are partitioned into disjoint pieces before launch, immutable tensors are shared, and generated launchers preserve ownership while GPU work is in flight. The same model supports synchronous launches, asynchronous pipelines, and CUDA graph replay. The #[cutile::module] macro embeds a captured Rust AST for each kernel in the host binary; when a kernel is needed, cuTile Rust JIT-compiles that AST through CUDA Tile IR into a GPU cubin. Local opt-outs remain available when lower-level control is needed.

We are excited to release this research project as a demonstration of how GPU programming can be made available in the Rust ecosystem. The software is in an early stage and under active development: you should expect bugs, incomplete features, and API breakage as we work to improve it. That being said, we hope you'll be interested to try it in your work and help shape its direction by providing feedback on your experience.

Please check out CONTRIBUTING.md if you're interested in contributing.

use cutile::prelude::*;

#[cutile::module]
mod kernel {
    use cutile::core::*;

    #[cutile::entry()]
    fn add<const B: i32>(
        z: &mut Tensor<f32, { [B] }>,
        x: &Tensor<f32, { [-1] }>,
        y: &Tensor<f32, { [-1] }>,
    ) {
        let tx = load_tile_like(x, z);
        let ty = load_tile_like(y, z);
        z.store(tx + ty);
    }
}

fn main() -> Result<(), Error> {
    let x = api::ones::<f32>(&[1024]);
    let y = api::ones::<f32>(&[1024]);
    let z = api::zeros::<f32>(&[1024]).partition([128]);

    let (_z, _x, _y) = kernel::add(z, x, y).sync()?;
    Ok(())
}

The #[cutile::module] macro transforms add into a GPU kernel and generates a host-side launcher. The host code constructs lazy tensor operations, partitions the mutable output into 128-element chunks, and calls .sync() to JIT-compile and execute the kernel.

The kernel signature carries the access discipline into device code: z is the exclusive mutable output, while x and y are shared read-only inputs. The body loads input tiles matching the output partition, adds them, and stores the result. The launch grid (8, 1, 1) is inferred from the partition: 1024÷128 = 8 tiles.

Run a similar example via cargo run -p cutile-examples --example saxpy.
More kernels and usage examples of the host-side API can be found here.

The cuTile Rust paper, Fearless Concurrency on the GPU, is available here. On NVIDIA B200, cuTile Rust reaches 7 TB/s for element-wise operations and 2 PFlop/s for GEMM, about 91% of peak memory bandwidth and 92% of dense f16 peak, respectively. The GEMM result is competitive with cuBLAS, and the B200 safety-overhead microbenchmarks show that cuTile Rust adds safety without measurable runtime overhead: safe Rust persistent GEMM reaches 2.07 PFlop/s at M=N=K=8192 (92% of the B200 dense f16 peak), within 0.3% of the corresponding low-level Tile IR variant.

The paper also evaluates Grout, a Qwen3 inference engine built with cuTile Rust in collaboration with Hugging Face. In batch-1 Qwen3 decode, Grout reaches 171 tokens/s for Qwen3-4B on NVIDIA GeForce RTX 5090 and 82 tokens/s for Qwen3-32B on B200, showing competitive state-of-the-art performance on memory-bound inference tasks as measured by our HBM roofline analysis.

Reproducibility artifacts for the paper evaluation are available here. The paper-facing measurements were run against cuTile Rust 0.2.0, and the version of Grout used for the paper is available here.

If you use cuTile Rust in research, please cite the paper:

@misc{elibol2026fearlessconcurrencygpu,
  title = {Fearless Concurrency on the GPU},
  author = {Elibol, Melih and Roesch, Jared and Gelado, Isaac and Buehler, Eric and Garland, Michael},
  year = {2026},
  eprint = {2606.15991},
  archivePrefix = {arXiv},
  primaryClass = {cs.PL},
  url = {https://arxiv.org/abs/2606.15991}
}

Related Projects and References

Grout: Qwen 3 inference engine in Rust by Hugging Face, built with cuTile Rust and useful as a reference for production kernel call sites.
cuTile Python: Python kernel programming with CUDA Tile.
TileGym: CUDA Tile kernel examples and tuning patterns.
cuda-oxide: NVlabs experimental Rust-to-CUDA compiler for writing SIMT-style GPU kernels in Rust.
CUDA Tile IR documentation: CUDA Tile IR reference documentation.
CUDA documentation: CUDA toolkit documentation.
Rust NVPTX backend: rustc's target support for generating PTX for NVIDIA GPUs.

cuTile Rust targets tile-based kernels that lower through CUDA Tile IR, with APIs built around tensor partitions and tensor-core-oriented operations.

NVIDIA GPU with compute capability sm_80 or higher (minimum supported architecture: sm_80).
- sm_100+ is supported by CUDA 13.1+.
- sm_8x support was added in CUDA 13.2.
- CUDA 13.3 adds sm_90 support, so CUDA 13.3 users now have sm_80+ coverage.
CUDA 13.3 recommended (sm_80+ support and CUDA Tile IR 13.3 features such as FP4 packing and block-scaled MMA).
Rust 1.89+
Linux (tested on Ubuntu 24.04)

To install Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup default stable

Install CUDA 13.3 for your OS by following the official instructions: https://developer.nvidia.com/cuda-downloads

Set CUDA_TOOLKIT_PATH to your CUDA 13.3 install directory.

Example .cargo/config.toml:

[env]
CUDA_TOOLKIT_PATH = { value = "/usr/local/cuda-13", relative = false }

Run the hello world example:

cargo run -p cutile-examples --example hello_world

If everything works, you should see: Hello, I am tile <0, 0, 0> in a kernel with <1, 1, 1> tiles.

We provide a Nix flake for easy setup and development. Flakes must be enabled in your Nix configuration, if not already, add to ~/.config/nix/nix.conf:

experimental-features = nix-command flakes

Run a command directly:

nix develop -c cargo run -p cutile-examples --example saxpy

Or open an interactive shell:

nix develop
# cutile-rs dev shell
#  ✓ CUDA  /nix/store/...-cuda-toolkit-13.3
#  ✓ Rust  1.90.0-nightly

The flake automatically locates host NVIDIA driver libraries on both NixOS and non-NixOS systems.

cuTile IR: cargo test --package cutile-ir
cuTile Rust Compiler: cargo test --package cutile-compiler
cuTile Rust Library: cargo test --package cutile
Examples: run an individual example, for example cargo run -p cutile-examples --example async_gemm
Benchmarks: cargo bench
Everything: ./scripts/run_all.sh (or pipe to a log file: ./scripts/run_all.sh 2>&1 | tee test_run.log)

cutile                 User-facing crate for authoring and executing tile kernels
├── cutile-macro
├── cutile-compiler
├── cuda-async
└── cuda-core

cutile-kernels         Reusable cuTile Rust kernels
└── cutile

cutile-macro           cuTile Rust proc-macro
└── cutile-compiler

cutile-compiler        Compiles cuTile Rust kernels to executables
├── cutile-ir
├── cuda-async
└── cuda-core

cutile-ir              Pure Rust Tile IR builder and bytecode writer

cuda-async             Async CUDA execution via async Rust
└── cuda-core

cuda-core              Idiomatic safe CUDA API
└── cuda-bindings

cuda-bindings          NVIDIA CUDA bindings

The cuda-bindings crate is licensed under NVIDIA Software License: LICENSE-NVIDIA. All other crates are licensed under the Apache License, Version 2.0 https://www.apache.org/licenses/LICENSE-2.0

```Show HN: cuTile Rust：Rust 语言中安全且无数据竞争的 GPU 内核``` Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust

Related Projects and References

```Show HN: cuTile Rust：Rust 语言中安全且无数据竞争的 GPU 内核```
Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust