CUDA-L2：通过强化学习超越cuBLAS矩阵乘法性能

CUDA-L2：通过强化学习超越cuBLAS矩阵乘法性能
CUDA-l2: Surpassing cuBLAS performance for matrix multiplication through RL

原始链接: https://github.com/deepreinforce-ai/CUDA-L2

## CUDA-L2：AI 优化的矩阵乘法 CUDA-L2 是一个新颖的系统，利用大型语言模型和强化学习自动优化用于半精度通用矩阵乘法 (HGEMM) 的 CUDA 内核。它在 1000 种 A100 配置中，明显优于现有解决方案，包括 PyTorch 的 `matmul` 和 NVIDIA 的闭源库，如 cuBLAS。内核具有硬件特异性；为 A100 优化的内核最好在 A100 GPU 上使用以保证加速，并计划发布针对其他架构的版本。对于超出提供的配置的用户，可以选择填充现有配置或通过 GitHub 请求新的内核发布。 **关键要求：** Python、PyTorch (2.6.0+) 和 NVIDIA CUTLASS (v4.2.1 – *务必下载正确版本*)。必须正确配置环境变量 `CUTLASS_DIR` 和 `TORCH_CUDA_ARCH_LIST`。评估通过 `eval_one_file.sh` 脚本进行，支持离线批量处理和查询每秒 (QPS) 目标服务器模式。支持和问题可以通过 GitHub issue 或电子邮件 ([email protected]) 提交给开发者。

## CUDA-L2：使用强化学习进行矩阵乘法 – 一种怀疑的观点一个名为CUDA-L2的新项目声称，使用强化学习在矩阵乘法方面超越了cuBLAS的性能。然而，Hacker News社区的初步反应大多表示怀疑。许多评论员认为，所谓的“发现”的技术并非新颖，可能是在“洗白”现有的GPU优化建议（例如在“GPU Gems”中找到的那些）。讨论强调，在基本算法上的真正突破很少见；改进通常来自于巧妙地应用已知技术。一位评论员指出，通过集中的优化工作，现有工具（如LLVM）可能实现显著的加速。虽然该项目在利用特定硬件配置以获得可衡量的收益方面显示出潜力，但人们对其依赖于代码特化和有限的输入支持（目前仅支持FP16，与FP32求解器进行比较）表示担忧。一些人质疑强化学习是否真的*发现*了任何新东西，或者只是在重新排列现有的知识。对性能提升的严格数值证明的需求也在讨论中。

原文

CUDA-L2 is a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used torch.matmul to state-of-the-art NVIDIA closed-source libraries (cuBLAS, cuBLASLt-heuristic, cuBLASLt-AutoTuning). Paper

[Dec 2, 2025] Released A100 optimized HGEMM kernels across 1,000 configurations.

Q: Do A100 kernels apply to other machines like RTX 3090 or H100?

A: Ideally, kernels trained on A100 should only be used on A100 if you are targeting speedup. They might have speedup on other machines, but it's not guaranteed. We will progressively release kernels trained on different machines.

Q: What if I need matrix dimensions (M, N, K) not found in your configurations?

A: 1. You can find the nearest neighbor configuration (larger than yours) and pad with zeros. 2. Feel free to post your dimensions on GitHub issues. We are happy to release kernels for your configuration.

Python: Ensure you have a working Python environment.
PyTorch: This project requires PyTorch version 2.6.0 or higher.

This project depends on NVIDIA CUTLASS. You must clone specific tag v4.2.1 into a directory named cutlass:

git clone -b v4.2.1 https://github.com/NVIDIA/cutlass.git cutlass

⚠️ Warning: Please ensure you download the correct CUTLASS version (v4.2.1) and set the CUTLASS_DIR environment variable correctly. Incorrect CUTLASS setup may cause the project to fail silently or produce no results.

Before building or running the project, you must configure the following environment variables:

CUTLASS_DIR: Points to the directory where you cloned CUTLASS.
TORCH_CUDA_ARCH_LIST: Specifies the target GPU architecture (e.g., "8.0" for NVIDIA Ampere / A100 / RTX 30 series).

Run the following commands:

export CUTLASS_DIR=/path/to/your/cutlass
export TORCH_CUDA_ARCH_LIST="8.0"

To run the evaluation, use the eval_one_file.sh script. Below is an example command for offline mode:

./eval_one_file.sh --mnk 64_4096_64 --warmup_seconds 5 --benchmark_seconds 10 --base_dir ./results --gpu_device_id 7 --mode offline

For server mode, you need to specify --target_qps:

./eval_one_file.sh --mnk 64_4096_64 --warmup_seconds 5 --benchmark_seconds 10 --base_dir ./results --gpu_device_id 7 --mode server --target_qps 100

Argument	Description
`--mnk`	Specifies the problem size (e.g., `64_4096_64`).
`--warmup_seconds`	Duration of warmup in seconds before timing.
`--benchmark_seconds`	Duration of benchmarking in seconds.
`--base_dir`	Directory to save the compile and output results.
`--gpu_device_id`	The ID of the GPU to use (e.g., `7`).
`--mode`	Execution mode. Options are: • `offline`: Runs the evaluation in offline/batch processing mode. • `server`: Runs the evaluation in server mode (simulating request-based scenarios).
`--target_qps`	Target Queries Per Second (QPS) for server mode. Required if mode is `server`.

If you have any questions, please open a GitHub issue or reach out to us at [email protected].

CUDA-L2：通过强化学习超越cuBLAS矩阵乘法性能 CUDA-l2: Surpassing cuBLAS performance for matrix multiplication through RL

CUDA-L2：通过强化学习超越cuBLAS矩阵乘法性能
CUDA-l2: Surpassing cuBLAS performance for matrix multiplication through RL