BarraCUDA 开源 CUDA 编译器，目标是 AMD GPU。

BarraCUDA 开源 CUDA 编译器，目标是 AMD GPU。
BarraCUDA Open-source CUDA compiler targeting AMD GPUs

原始链接: https://github.com/Zaneham/BarraCUDA

## BarraCUDA：适用于AMD GPU的CUDA编译器 BarraCUDA是一个15,000行的开源编译器，使用C99编写，可直接将CUDA C代码（.cu文件）转换为AMD GPU可执行二进制文件（.hsaco），*无需*依赖LLVM或HIP翻译层。它面向AMD的RDNA 3（GFX11）架构，旨在为在AMD硬件上运行CUDA应用程序提供直接途径。该编译器具有完整的流水线——预处理、词法分析、解析、语义分析、中间表示（BIR）、指令选择、寄存器分配和二进制编码——全部从头开始构建。它支持广泛的CUDA特性，包括共享内存、原子操作、warp内联函数和模板。虽然功能齐全，但BarraCUDA是一个个人项目，缺乏一些特性，例如无符号整数支持和常量内存处理。然而，其核心架构被设计为可扩展的，计划支持其他架构，如Tenstorrent和Intel Arc。该项目优先考虑简单性和直接的构建过程（仅使用`make`），并采用Apache 2.0许可协议。

## BarraCUDA：面向AMD GPU的开源CUDA编译器 BarraCUDA是一个新的开源项目，旨在为AMD GPU带来CUDA兼容性，*无需*依赖LLVM或AI生成的代码。该项目由一名新西兰程序员开发，直接编码指令，这是一项重要的、因其技术深度和独立于当前AI趋势而受到赞扬的工程。虽然项目所有者利用LLM进行测试摘要等任务，但核心开发有意避免对其的过度依赖。讨论强调了这一点的重要性，因为AMD历史上缺乏强大的CUDA支持，这促使用户转向NVIDIA。一些评论员指出ZLUDA和HIP等现有项目作为替代方案，但BarraCUDA独特的实现方法和对基于C99的干净实现的关注正在引起关注。然而，它目前的目标是GFX11（RDNA3）架构，并且在扩展到企业级GPU时可能会面临挑战。尽管如此，该项目被视为增加GPU市场竞争和潜在开源开发胜利的重要一步。

原文

An open-source CUDA compiler that targets AMD GPUs, with more architectures planned. Written in 15,000 lines of C99. Zero LLVM dependency. Compiles .cu files straight to GFX11 machine code and spits out ELF .hsaco binaries that AMD GPUs can actually run.

This is what happens when you look at NVIDIA's walled garden and think "how hard can it be?" The answer is: quite hard, actually, but I did it anyway.

Takes CUDA C source code, the same .cu files you'd feed to nvcc, and compiles them to AMD RDNA 3 (gfx1100) binaries. No LLVM. No HIP translation layer. No "convert your CUDA to something else first." Just a lexer, a parser, an IR, and roughly 1,700 lines of hand-written instruction selection that would make a compiler textbook weep.

┌──────────────────────────────────────────────────────────────┐
│                     BarraCUDA Pipeline                        │
├──────────────────────────────────────────────────────────────┤
│  Source (.cu)                                                │
│       ↓                                                      │
│  Preprocessor → #include, #define, macros, conditionals      │
│       ↓                                                      │
│  Lexer → Tokens                                              │
│       ↓                                                      │
│  Parser (Recursive Descent) → AST                            │
│       ↓                                                      │
│  Semantic Analysis → Type checking, scope resolution         │
│       ↓                                                      │
│  BIR (BarraCUDA IR) → SSA form, typed instructions           │
│       ↓                                                      │
│  mem2reg → Promotes allocas to SSA registers                  │
│       ↓                                                      │
│  Instruction Selection → AMDGPU machine instructions         │
│       ↓                                                      │
│  Register Allocation → VGPR/SGPR assignment                  │
│       ↓                                                      │
│  Binary Encoding → GFX11 instruction words                   │
│       ↓                                                      │
│  ELF Emission → .hsaco ready for the GPU                     │
│       ↓                                                      │
│  Your kernel runs on ya silicon                              │
└──────────────────────────────────────────────────────────────┘

Every single encoding has been validated against llvm-objdump with zero decode failures. I didn't use LLVM to compile, but I did use it to check my homework.

# It's C99. It builds with gcc. There are no dependencies.
make

# That's it. No cmake. No autoconf. No 47-step build process.
# If this doesn't work, your gcc is broken, not the Makefile.

A C99 compiler (gcc, clang, whatever you've got)
A will to live (optional but recommended)
LLVM is NOT required. BarraCUDA does its own instruction encoding like an adult.

# Compile to AMD GPU binary
./barracuda --amdgpu-bin kernel.cu -o kernel.hsaco

# Dump the IR (for debugging or curiosity)
./barracuda --ir kernel.cu

# Just parse and dump the AST
./barracuda --ast kernel.cu

# Run semantic analysis
./barracuda --sema kernel.cu

The following CUDA features compile to working GFX11 machine code:

__global__, __device__, __host__ function qualifiers
threadIdx, blockIdx, blockDim, gridDim builtins
Structs, enums, typedefs, namespaces
Pointers, arrays, pointer arithmetic
All C control flow: if/else, for, while, do-while, switch/case, goto/label
Short-circuit && and ||
Ternary operator
Templates (basic instantiation)
Multiple return paths, continue, break

__shared__ memory (allocated from LDS, properly tracked)
__syncthreads() → s_barrier
Atomic operations: atomicAdd, atomicSub, atomicMin, atomicMax, atomicExch, atomicCAS, atomicAnd, atomicOr, atomicXor
Warp intrinsics: __shfl_sync, __shfl_up_sync, __shfl_down_sync, __shfl_xor_sync
Warp votes: __ballot_sync, __any_sync, __all_sync
Vector types: float2, float3, float4, int2, int3, int4 with .x/.y/.z/.w access
Half precision: __half, __float2half(), __half2float()
__launch_bounds__ (parsed, propagated, enforces VGPR caps)
Cooperative groups: cooperative_groups::this_thread_block() with .sync(), .thread_rank(), .size()
Operator overloading

Full C preprocessor: #include, #define/#undef, function-like macros, #ifdef/#ifndef/#if/#elif/#else/#endif, #pragma, #error, -I/-D flags
Error recovery (reports multiple errors without hanging)
Source location tracking in IR dumps
Struct pass-by-value

__global__ void vector_add(float *c, float *a, float *b, int n)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n)
        c[idx] = a[idx] + b[idx];
}

$ ./barracuda --amdgpu-bin vector_add.cu -o vector_add.hsaco
wrote vector_add.hsaco (528 bytes code, 1 kernels)

No LLVM required :-)

File	Lines	What It Does
`lexer.c`	747	Tokenises CUDA C source
`preproc.c`	1,370	C preprocessor (macros, includes, conditionals)
`parser.c`	1,500	Recursive descent parser → AST
`sema.c`	1,725	Type checking, scope resolution, overload resolution
`bir.c` + `bir_lower.c`	3,032	SSA intermediate representation + AST→BIR lowering
`bir_mem2reg.c`	965	Promotes stack allocas to SSA registers
`bir_print.c`	579	IR pretty printer with source location annotations
`amdgpu_isel.c`	1,788	Instruction selection: BIR → AMDGPU machine ops
`amdgpu_emit.c`	1,735	Register allocation + GFX11 binary encoding + ELF emission
`main.c`	317	CLI driver
Total	15,117

All data structures use pre-allocated fixed-size arrays. No malloc in hot paths. No recursion. Bounded loops everywhere. The kind of code that would make JPL's coding standards committee nod approvingly before going back to landing things on Mars.

Being honest about limitations is important. Here's what's missing:

unsigned as a bare type specifier (use unsigned int or just int)
+=, -=, >>= and friends (compound assignment, spell it out for now)
const qualifier
__constant__ memory
2D array declarations in shared memory (__shared__ float a[16][16], flatten to 1D)
Integer literal suffixes (0xFFu, 1ULL)
Parameter reassignment in __device__ functions (use local variables)
Textures and surfaces
Dynamic parallelism (device-side kernel launch)
Multiple translation units
Host code generation (only device code is compiled)

None of these are architectural blockers. They're all "haven't got round to it yet" items.

14 test files, 35+ kernels, ~1,700 BIR instructions, ~27,000 bytes of machine code:

vector_add.cu - The "hello world" of GPU computing
cuda_features.cu - Atomics, warp ops, barriers, gotos, switch, short-circuit
test_tier12.cu - Vectors, shared memory, operator overloading
notgpt.cu - AI-generated CUDA with extremely sarcastic comments (tiled SGEMM, reductions, histograms, prefix scan, stencils, half precision, cooperative groups, and the "kitchen sink" kernel)
stress.cu - N-body simulation, nested control flow, bit manipulation, struct pass-by-value, chained function calls
canonical.cu - Canonical patterns from NVIDIA samples adapted for the parser
test_errors.cu - Deliberate syntax errors to verify error recovery
test_launch_bounds.cu - __launch_bounds__ parsing and VGPR cap enforcement
test_coop_groups.cu - Cooperative groups lowering
Plus preprocessor tests, template tests, unsigned integer tests

Fix the known gaps: compound assignment operators, bare unsigned, integer literal suffixes, const, parameter reassignment. These are all small parser/lowerer changes. The goal is to compile real-world .cu files without modifications.

Medium Term: Optimisation

The generated code works but isn't winning any benchmarks. Priorities:

Instruction scheduling (hide memory latency)
Better register allocation (currently linear scan, consider graph colouring)
Constant folding and dead code elimination
Loop-invariant code motion
Occupancy tuning based on register pressure

Long Term: More Architectures

The IR (BIR) is target-independent. The backend is cleanly separated. Adding a new target means writing a new isel + emit pair. Candidates:

Tenstorrent - RISC-V based AI accelerators. Open ISA. Very different execution model (tile-based, not SIMT) but the IR maps well.
Intel Arc - Xe architecture. Would give BarraCUDA coverage across all three major GPU vendors.
RISC-V Vector Extension - For when GPUs are too mainstream and you want to run CUDA on a softcore.

GFX11 Encoding Notes (For The Brave)

If you're considering writing your own AMDGPU backend, here are the things that will ruin your afternoon:

SOP1 prefix is 0xBE800000, not what you'd expect from the docs
SOPC prefix is 0xBF000000
VOP3 VDST is at bits [7:0], not [15:8] like a sensible person would assume
Null SADDR is 0x7C for global memory, 0xFC for scratch
RDNA 3 is Wave32 by default, not Wave64 like GCN
The ISA manual is 500 pages and contradicts itself at least twice

All 1,735 lines of amdgpu_emit.c are a testament to reading those pages so you don't have to.

Found a bug? Want to discuss the finer points of AMDGPU instruction encoding? Need someone to commiserate with about the state of GPU computing?

[email protected]

Open an issue if theres anything you want to discuss. Or don't. I'm not your mum.

Based in New Zealand, where it's already tomorrow and the GPUs are just as confused as everywhere else.

Apache 2.0. Do whatever you want. If this compiler somehow ends up in production, I'd love to hear about it, mostly so I can update my LinkedIn with something more interesting than wrote a CUDA compiler for fun.