CUDA Tile 开源了

CUDA Tile 开源了
CUDA Tile Open Sourced

原始链接: https://github.com/NVIDIA/cuda-tile

## CUDA Tile：摘要 CUDA Tile 是一套基于 MLIR 的开源基础设施，用于优化 CUDA 内核，特别是针对基于切片的计算和 NVIDIA Tensor Core。它与 CUDA Toolkit 13.1 一同发布，通过提供切片模式、内存管理和 GPU 特定优化的抽象，简化了高性能 CUDA 内核的开发。该项目包含一个 CUDA Tile 方言（一种特定领域的 MLIR 语言）、用于 IR 操作的 Python 绑定、一种高效的字节码表示形式以及一套符合性测试套件。开发由形式化的 IR 规范引导，该规范定义了其操作和类型系统。 **主要特性包括：** * **MLIR 基础：** 构建于灵活的 MLIR 编译器基础设施之上。 * **切片优先：** 针对切片计算模式进行了优化。 * **Python API：** 支持程序化的 IR 构建和转换。 * **字节码格式：** 支持序列化和反序列化。 **构建 CUDA Tile 需要：** CMake 3.20+、C++17 编译器、Python 3.6+ 以及兼容的 MLIR/LLVM 源代码。可以通过预构建的库或直接包含源代码将其集成到项目中。目前尚不接受外部贡献，但鼓励用户通过 GitHub Issues 提供反馈。

## CUDA Tile 开源 - Hacker News 总结英伟达已开源“CUDA Tile”，一个在 Apache License v2.0 和 LLVM 例外条款下发布的中间表示 (IR) 项目。Hacker News 上的讨论反应不一。一些人认为这是一个积极的步骤，但也有人担心潜在的厂商锁定，这让人联想到 CUDA 过去的经历。许多评论员指出 AMD 和 Intel 在竞争 API 方面遇到的困难，并强调需要更广泛的行业采用，这与 OpenCL 等项目的开放方法形成对比。关于 Mojo 的作用也有争论，Mojo 是一种旨在简化 GPU 编程的新编程语言，一些人认为英伟达可能会收购它。一个反复出现的主题是开发者体验的重要性以及对闭源组件的沮丧，特别是编译器。对话还涉及 RISC-V 向量指令的潜力以及对更好工具和社区建设的需求，并提到了谷歌的 XLA & IREE 作为成功的例子。最后，出现了一场关于在没有初始定义的情况下使用缩写词的做法的轻松辩论。

原文

CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns and optimizations targeting NVIDIA tensor core units. The project provides a comprehensive ecosystem for expressing and optimizing tiled computations for NVIDIA GPUs, simplifying the development of high-performance CUDA kernels through abstractions for common tiling patterns, memory hierarchy management, and GPU-specific optimizations.

This open-source release is aligned with the CUDA Toolkit 13.1 release. For more information about CUDA Tile, visit https://developer.nvidia.com/cuda/tile.

CUDA Tile is composed of:

CUDA Tile Dialect: A domain-specific MLIR dialect that provides first-class operations and types for tile-based computations
Python Bindings: Complete Python API for programmatic IR construction, manipulation, and transformation
Bytecode:: Efficient binary representation with support for serialization and de-serialization between the CUDA Tile dialect and binary format.
Conformance Test Suite: Comprehensive tests ensuring compliance with the CUDA Tile specification and validation of dialect semantics

CUDA Tile development is driven by the CUDA Tile IR specification, which defines the formal semantics, operations, and type system for tile-based computations on NVIDIA GPUs. For detailed information about the CUDA Tile IR specification, including dialect operations, type system, and transformation passes, please refer to the CUDA Tile Specification.

CMake 3.20.0 or higher
C++17 compatible compiler
Python 3.6+ (for Python bindings)
MLIR/LLVM sources or pre-built libraries at a compatible commit (see cmake/IncludeLLVM.cmake for the exact version)
Ninja build system (optional)

For a quick start, use the following commands from the top of the repository to configure and build a release version of CUDA Tile with Python bindings enabled. MLIR/LLVM sources will be automatically downloaded from https://github.com/llvm/llvm-project:

# Configure
cmake -G Ninja -S . -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLVM_ENABLE_ASSERTIONS=OFF \
  -DCUDA_TILE_ENABLE_BINDINGS_PYTHON=ON

# Build
cmake --build build

# Run tests
cmake --build build --target check-cuda-tile

Build Configuration Options

MLIR/LLVM Build Configuration

CUDA Tile requires MLIR/LLVM at a specific compatible commit. The exact commit hash is specified in cmake/IncludeLLVM.cmake. CUDA Tile can be built with MLIR/LLVM in three different ways:

Automatic Download from GitHub (Default): CMake automatically downloads MLIR/LLVM sources from the official GitHub repository and builds them at the compatible commit. This is the slowest option but requires no manual LLVM setup.
```
cmake -G Ninja -S . -B build -DCMAKE_BUILD_TYPE=Release
```
Use Local LLVM Sources: CMake builds MLIR/LLVM from existing sources on your system. The commit hash of the source must be compatible with commit specified in cmake/IncludeLLVM.cmake.
```
cmake -G Ninja -S . -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCUDA_TILE_USE_LLVM_SOURCE_DIR=/path/to/llvm/sources
```
Use Pre-built LLVM Libraries: CMake links against pre-compiled LLVM libraries. The commit hash of the source must be compatible with commit specified in cmake/IncludeLLVM.cmake.
```
cmake -G Ninja -S . -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCUDA_TILE_USE_LLVM_INSTALL_DIR=/path/to/llvm/install
```

CUDA Tile provides Python bindings for programmatic IR manipulation (disabled by default). To enable them, add the -DCUDA_TILE_ENABLE_BINDINGS_PYTHON=ON flag to your cmake configuration:

cmake -G Ninja -S . -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCUDA_TILE_ENABLE_BINDINGS_PYTHON=ON

When building MLIR/LLVM from sources, MLIR Python bindings will be automatically enabled. However, when using pre-built LLVM libraries, you must ensure they were built with -DMLIR_ENABLE_BINDINGS_PYTHON=ON.

To build with ccache enabled, add -DCUDA_TILE_ENABLE_CCACHE=ON to your cmake configuration:

cmake -G Ninja -S . -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCUDA_TILE_ENABLE_CCACHE=ON

When building LLVM from sources, this setting is automatically propagated to the LLVM build.

CUDA Tile uses LLVM's lit testing infrastructure for comprehensive testing. Testing is enabled by default (-DCUDA_TILE_ENABLE_TESTING=ON). To run the test suite:

cmake --build build --target check-cuda-tile

Integrating CUDA Tile Into Your Project

CUDA Tile can be integrated into your project in two ways, depending on your build system and requirements.

Option 1: Using Pre-built CUDA Tile Libraries

To use pre-built CUDA Tile libraries in your project, include the necessary headers and link against the required libraries based on your use case. For example:

include_directories(${CUDA_TILE_INSTALL_DIR}/include)

# CUDA Tile dialect
target_link_libraries(your_target PRIVATE
  CudaTileDialect           # CUDA Tile dialect operations and types
)

# Bytecode support.
target_link_libraries(your_target PRIVATE
  CudaTileBytecodeReader    # Read bytecode format
  CudaTileBytecodeWriter    # Write bytecode format
)

Option 2: Integrating CUDA Tile Sources

To build CUDA Tile from source as part of your project:

Integrate CUDA Tile sources into your project with CMake's FetchContent, Git submodules, or any other integration method. Example using FetchContent:

include(FetchContent)

# Define CUDA Tile directories
set(CUDA_TILE_SOURCE_DIR ${CMAKE_BINARY_DIR}/_deps/cuda_tile-src)
set(CUDA_TILE_BINARY_DIR ${CMAKE_BINARY_DIR}/_deps/cuda_tile-build)

FetchContent_Declare(
  cuda_tile
  GIT_REPOSITORY https://github.com/NVIDIA/cuda-tile.git
  GIT_TAG        main
  SOURCE_DIR     ${CUDA_TILE_SOURCE_DIR}
  BINARY_DIR     ${CUDA_TILE_BINARY_DIR}
)

Configure CUDA Tile build options (before calling FetchContent_MakeAvailable, if using FetchContent):

set(CUDA_TILE_USE_LLVM_INSTALL_DIR ${YOUR_LLVM_INSTALL_DIR} CACHE PATH "")
set(CUDA_TILE_ENABLE_BINDINGS_PYTHON ON CACHE BOOL "")
set(CUDA_TILE_ENABLE_TESTING OFF CACHE BOOL "")

FetchContent_MakeAvailable(cuda_tile)

Include headers from source and build directories, then link libraries as in Option 1:

include_directories(${CUDA_TILE_SOURCE_DIR}/include)
include_directories(${CUDA_TILE_BINARY_DIR}/include)

Example: Writing and Running a CUDA Tile IR Program

The following shows how to compile and run a simple Tile IR kernel that prints data from a pointer.

Tile IR bytecode can be produced from an MLIR program using the cuda-tile-translate tool. This can be loaded directly using the CUDA driver API, which will JIT compile the program automatically. To compile ahead of time, you can use the tileiras tool from the CUDA Toolkit to compile the bytecode into a cubin for a particular GPU target. This example shows the latter to illustrate the extra step, but the driver launch API is the same in either case (just substitute the path to the bytecode file).

This example assumes you have built the CUDA Tile IR dialect tools according to the instructions above.

You will need a supported CUDA device, CUDA Toolkit 13.1+, and a compatible driver.

Save the following into a file example.mlir.

cuda_tile.module @example_module {
    entry @example_kernel(%data_pr : tile<ptr<f32>>) {
        print "Running example module\n"
        %offsets = iota : tile<128xi32>
        %data_ptr_reshaped = reshape %data_pr : tile<ptr<f32>> -> tile<1xptr<f32>>
        %data_ptr_broadcasted = broadcast %data_ptr_reshaped : tile<1xptr<f32>> -> tile<128xptr<f32>>
        %data_ptr_tensor = offset %data_ptr_broadcasted, %offsets : tile<128xptr<f32>>, tile<128xi32> -> tile<128xptr<f32>>
        %data, %token = load_ptr_tko weak %data_ptr_tensor : tile<128xptr<f32>> -> tile<128xf32>, token
        print "Data: %f\n", %data : tile<128xf32>
        return
    }
}

Save the following into a file example_host.cpp.

#include <cuda.h>
#include <cuda_runtime_api.h>
#include <stdio.h>
#include <stdlib.h>

// Macro to check for errors from CUDA driver API calls.
#define CUDA_CHECK(call)                                                       \
  do {                                                                         \
    CUresult err = call;                                                       \
    if (err != CUDA_SUCCESS) {                                                 \
      const char *errStr;                                                      \
      cuGetErrorString(err, &errStr);                                          \
      fprintf(stderr, "CUDA error at %s:%d: %s\n", __FILE__, __LINE__,         \
              errStr);                                                         \
      exit(1);                                                                 \
    }                                                                          \
  } while (0)

// Data tile to be passed to the kernel.
float data[] = {0,   5,   10,  15,  20,  25,  30,  35,  40,  45,  50,  55,  60,
                65,  70,  75,  80,  85,  90,  95,  100, 105, 110, 115, 120, 125,
                130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190,
                195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255,
                260, 265, 270, 275, 280, 285, 290, 295, 300, 305, 310, 315, 320,
                325, 330, 335, 340, 345, 350, 355, 360, 365, 370, 375, 380, 385,
                390, 395, 400, 405, 410, 415, 420, 425, 430, 435, 440, 445, 450,
                455, 460, 465, 470, 475, 480, 485, 490, 495, 500, 505, 510, 515,
                520, 525, 530, 535, 540, 545, 550, 555, 560, 565, 570, 575, 580,
                585, 590, 595, 600, 605, 610, 615, 620, 625, 630, 635};

int main() {
  // Declare and initialize CUDA driver API handles.
  CUdevice cuDevice;
  CUcontext cuContext;
  CUmodule cuModule;
  CUfunction example_kernel;
  CUstream stream;

  CUDA_CHECK(cuInit(0));
  CUDA_CHECK(cuDeviceGet(&cuDevice, 0));
  CUDA_CHECK(cuCtxCreate(&cuContext, NULL, 0, cuDevice));
  CUDA_CHECK(cuStreamCreate(&stream, CU_STREAM_DEFAULT));

  // Load the compiled cubin file and get the entry CUDA Tile IR function.
  // CUDA Tile IR bytecode can also be directly loaded (JIT compilation).
  CUDA_CHECK(cuModuleLoad(&cuModule, "example.cubin"));
  CUDA_CHECK(cuModuleGetFunction(&example_kernel, cuModule, "example_kernel"));

  // Allocate memory on the device and copy the input data to it.
  CUdeviceptr data_ptr;
  CUDA_CHECK(cuMemAlloc(&data_ptr, sizeof(data)));
  CUDA_CHECK(cuMemcpyHtoD(data_ptr, data, sizeof(data)));

  // Launch the kernel.
  void *kernel_args[] = {&data_ptr};
  CUDA_CHECK(cuLaunchKernel(example_kernel, // function
                            1, 1, 1,        // grid dims: must be (1,1,1)
                            1, 1, 1,        // block dims
                            0,              // shared memory bytes: must be 0
                            stream,         // cuda stream
                            kernel_args,    // kernel arguments
                            NULL            // extra parameters
                            ));
  CUDA_CHECK(cuCtxSynchronize());

  // Clean up.
  CUDA_CHECK(cuModuleUnload(cuModule));
  CUDA_CHECK(cuCtxDestroy(cuContext));

  return 0;
}

Compile the textual mlir program to CUDA Tile IR bytecode: cuda-tile-translate example.mlir --bytecode-version=13.1 --mlir-to-cudatilebc --no-implicit-module -o example.tilebc.
For AoT compilation, compile the bytecode file to a cubin: tileiras --gpu-name sm_100 example.tilebc -o example.cubin.
1. Substitute sm_100 with your supported target architecture.
2. To JIT compile the bytecode at launch time, skip this step and replace example.cubin with example.tilebc in host_example.cpp.
Compile the host program: g++ example_host.cpp -o example -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcuda.
1. Substitute g++ with your C++ compiler, and the paths with the correct paths to your CUDA headers and libraries.
Execute: ./example.

You should see the following terminal output:

Running example module
Data: [0.000000, 5.000000, 10.000000, 15.000000, 20.000000, 25.000000, 30.000000, 35.000000, 40.000000, 45.000000, 50.000000, 55.000000, 60.000000, 65.000000, 70.000000, 75.000000, 80.000000, 85.000000, 90.000000, 95.000000, 100.000000, 105.000000, 110.000000, 115.000000, 120.000000, 125.000000, 130.000000, 135.000000, 140.000000, 145.000000, 150.000000, 155.000000, 160.000000, 165.000000, 170.000000, 175.000000, 180.000000, 185.000000, 190.000000, 195.000000, 200.000000, 205.000000, 210.000000, 215.000000, 220.000000, 225.000000, 230.000000, 235.000000, 240.000000, 245.000000, 250.000000, 255.000000, 260.000000, 265.000000, 270.000000, 275.000000, 280.000000, 285.000000, 290.000000, 295.000000, 300.000000, 305.000000, 310.000000, 315.000000, 320.000000, 325.000000, 330.000000, 335.000000, 340.000000, 345.000000, 350.000000, 355.000000, 360.000000, 365.000000, 370.000000, 375.000000, 380.000000, 385.000000, 390.000000, 395.000000, 400.000000, 405.000000, 410.000000, 415.000000, 420.000000, 425.000000, 430.000000, 435.000000, 440.000000, 445.000000, 450.000000, 455.000000, 460.000000, 465.000000, 470.000000, 475.000000, 480.000000, 485.000000, 490.000000, 495.000000, 500.000000, 505.000000, 510.000000, 515.000000, 520.000000, 525.000000, 530.000000, 535.000000, 540.000000, 545.000000, 550.000000, 555.000000, 560.000000, 565.000000, 570.000000, 575.000000, 580.000000, 585.000000, 590.000000, 595.000000, 600.000000, 605.000000, 610.000000, 615.000000, 620.000000, 625.000000, 630.000000, 635.000000]

Contributions and Support

Note: We are currently not accepting external contributions.

While CUDA Tile is an open-source project, we are not accepting external contributions at this time. The project is under active development with a focused roadmap. We encourage you to use GitHub Issues to report bugs, provide feedback, and share your experiences with CUDA Tile. Your input helps us improve the project and prioritize future development.

CUDA Tile IR is licensed under the Apache License v2.0 with LLVM Exceptions.