AMD GPU 调试器
AMD GPU Debugger

原始链接: https://thegeeko.me/blog/amd-gpu-debugging/

本文详细介绍了作者创建GPU调试器的过程,其动机在于缺乏与CPU调试器相媲美的工具,而GPU的固有复杂、并发执行模型使其调试困难。该项目最初专注于AMD的ROCm环境,但扩展到更通用的方法。 核心在于通过Direct Rendering Manager (DRM)接口直接与GPU的内核模式驱动程序 (KMD) 交互。这需要打开`/dev/dri/cardX`,分配GPU内存缓冲区(用于代码和命令),并将命令提交到GPU的命令处理器 (CP)。一个关键挑战是管理非缓存内存访问。 作者利用现有的工具,如RADV(一个Vulkan实现)及其ACO编译器来编译着色器。一个关键步骤是启用陷阱处理程序——一种暂停执行的机制——通过操作debugfs中的特权GPU寄存器来实现。这涉及找到正确的虚拟机ID (VMID) 并写入TBA和TMA等寄存器。 最终目标是启用断点、步进和变量检查。作者概述了实现这些功能的计划,包括利用硬件对观察点(watchpoints)的支持,并与RADV集成以获得更丰富的调试信息,例如源代码映射。本文最后提供了用于遍历页表以将虚拟地址转换为物理地址的初步代码,这是内存检查的基本步骤。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 AMD GPU 调试器 (thegeeko.me) 16 分,ibobev 发表于 23 分钟前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文
index

I’ve always wondered why we don’t have a GPU debugger similar to the one used for CPUs. A tool that allows pausing execution and examining the current state. This capability feels essential, especially since the GPU’s concurrent execution model is much harder to reason about. After searching for solutions, I came across rocgdb, a debugger for AMD’s ROCm environment. Unfortunately, its scope is limited to that environment. Still, this shows it’s technically possible. I then found a helpful series of blog posts by Marcell Kiss, detailing how he achieved this, which inspired me to try to recreate the process myself.

The best place to start learning about this is RADV. By tracing what it does, we can find how to do it. Our goal here is to run the most basic shader nop 0 without using Vulkan, aka RADV in our case.

First of all, we need to open the DRM file to establish a connection with the KMD, using a simple open(“/dev/dri/cardX”), then we find that it’s calling amdgpu_device_initialize, which is a function defined in libdrm, which is a library that acts as middleware between user mode drivers(UMD) like RADV and and kernel mode drivers(KMD) like amdgpu driver, and then when we try to do some actual work we have to create a context which can be achieved by calling amdgpu_cs_ctx_create from libdrm again, next up we need to allocate 2 buffers one of them for our code and the other for writing our commands into, we do this by calling a couple of functions, here’s how I do it:

Here we’re choosing the domain and assigning flags based on the params, some buffers we will need uncached, as we will see:

Now we have the memory, we need to map it. I opt to map anything that can be CPU-mapped for ease of use. We have to map the memory to both the GPU and the CPU virtual space. The KMD creates the page table when we open the DRM file, as shown here.

So map it to the GPU VM and, if possible, to the CPU VM as well. Here, at this point, there’s a libdrm function that does all of this setup for us and maps the memory, but I found that even when specifying AMDGPU_VM_MTYPE_UC, it doesn’t always tag the page as uncached, not quite sure if it’s a bug in my code or something in libdrm anyways, the function is amdgpu_bo_va_op, I opted to do it manually here and issue the IOCTL call myself:

Now we have the context and 2 buffers. Next, fill those buffers and send our commands to the KMD, which will then forward them to the Command Processor (CP) in the GPU for processing.

Let’s compile our code. We can use clang assembler for that, like this:

The bash script compiles the code, and then we’re only interested in the actual machine code, so we use objdump to figure out the offset and the size of the section and copy it to a new file called asmc.bin, then we can just load the file and write its bytes to the CPU-mapped address of the code buffer.

Next up, filling in the commands. This was extremely confusing for me because it’s not well documented. It was mostly learning how RADV does things and trying to do similar things. Also, shout-out to the folks on the Graphics Programming Discord server for helping me, especially Picoduck. The commands are encoded in a special format called PM4 Packets, which has multiple types. We only care about Type 3: each packet has an opcode and the number of bytes it contains.

The first thing we need to do is program the GPU registers, then dispatch the shader. Some of those registers are rsrc[1-3]; those registers are responsible for a number of configurations, pgm_[lo/hi], which hold the pointer to the code buffer and num_thread_[x/y/z]; those are responsible for the number of threads inside a work group. All of those are set using the set shader register packets, and here is how to encode them:

It’s worth mentioning that we can set multiple registers in 1 packet if they’re consecutive.

Then we append the dispatch command:

Now we want to write those commands into our buffer and send them to the KMD:

Here is a good point to make a more complex shader that outputs something. For example, writing 1 to a buffer.

No GPU hangs ?! nothing happened ?! cool, cool, now we have a shader that runs on the GPU, what’s next? Let’s try to hang the GPU by pausing the execution, aka make the GPU trap.

The RDNA3’s ISA manual does mention 2 registers, TBA, TMA; here’s how they describe them respectively:

Holds the pointer to the current trap handler program address. Per-VMID register. Bit [63] indicates if the trap handler is present (1) or not (0) and is not considered part of the address (bit[62] is replicated into address bit[63]). Accessed via S_SENDMSG_RTN.

Temporary register for shader operations. For example, it can hold a pointer to memory used by the trap handler.

You can configure the GPU to enter the trap handler when encountering certain exceptions listed in the RDNA3 ISA manual.

We know from Marcell Kiss’s blog posts that we need to compile a trap handler, which is a normal shader the GPU switches to when encountering a s_trap. The TBA register has a special bit that indicates whether the trap handler is enabled.

Since these are privileged registers, we cannot write to them from user space. To bridge this gap for debugging, we can utilize the debugfs interface. Luckily, we have UMR, which uses that debugfs interface, and it’s open source; we copy AMD’s homework here which is great.

The amdgpu KMD has a couple of files in debugfs under /sys/kernel/debug/dri/{PCI address}; one of them is regs2, which is an interface to a amdgpu_debugfs_regs2_write in the kernel that writes to the registers. It works by simply opening the file, seeking the register’s offset, and then writing; it also performs some synchronisation and writes the value correctly. We need to provide more parameters about the register before writing to the file, tho and do that by using an ioctl call. Here are the ioctl arguments:

The 2 structs are because there are 2 types of registers, GRBM and SRBM, each of which is banked by different constructs; you can learn more about some of them here in the Linux kernel documentation.

Turns out our registers here are SBRM registers and banked by VMIDs, meaning each VMID has its own TBA and TMA registers. Cool, now we need to figure out the VMID of our process. As far as I understand, VMIDs are a way for the GPU to identify a specific process context, including the page table base address, so the address translation unit can translate a virtual memory address. The context is created when we open the DRM file. They get assigned dynamically at dispatch time, which is a problem for us; we want to write to those registers before dispatch.

We can obtain the VMID of the dispatched process by querying the HW_ID2 register with s_getreg_b32. I do a hack here, by enabling the trap handler in every VMID, and there are 16 of them, the first being special, and used by the KMD and the last 8 allocated to the amdkfd driver. We loop over the remaining VMIDs and write to those registers. This can cause issues to other processes using other VMIDs by enabling trap handlers in them and writing the virtual address of our trap handler, which is only valid within our virtual memory address space. It’s relatively safe tho since most other processes won’t cause a trap1.

Now we can write to TMA and TBA, here’s the code:

And here’s how we write to TMA and TBA: If you noticed, I’m using bitfields. I use them because working with them is much easier than macros, and while the byte order is not guaranteed by the C spec, it’s guaranteed by System V ABI, which Linux adheres to.

Anyway, now that we can write to those registers, if we enable the trap handler correctly, the GPU should hang when we launch our shader if we added s_trap instruction to it, or we enabled the TRAP_ON_START bit in rsrc32 register.

Now, let’s try to write a trap handler.

If you wrote a different shader that outputs to a buffer, u can try writing to that shader from the trap handler, which is nice to make sure it’s actually being run.

We need 2 things: our trap handler and some scratch memory to use when needed, which we will store the address of in the TMA register.

The trap handler is just a normal program running in privileged state, meaning we have access to special registers like TTMP[0-15]. When we enter a trap handler, we need to first ensure that the state of the GPU registers is saved, just as the kernel does for CPU processes when context-switching, by saving a copy of the stable registers and the program counter, etc. The problem, tho, is that we don’t have a stable ABI for GPUs, or at least not one I’m aware of, and compilers use all the registers they can, so we need to save everything.

AMD GPUs’ Command Processors (CPs) have context-switching functionality, and the amdkfd driver does implement some context-switching shaders. The problem is they’re not documented, and we have to figure them out from the amdkfd driver source and from other parts of the driver stack that interact with it, which is a pain in the ass. I kinda did a workaround here since I didn’t find luck understanding how it works, and some other reasons I’ll discuss later in the post.

The workaround here is to use only TTMP registers and a combination of specific instructions to copy the values of some registers, allowing us to use more instructions to copy the remaining registers. The main idea is to make use of the global_store_addtid_b32 instruction, which adds the index of the current thread within the wave to the writing address, aka

IDthread4+addressID_{thread} * 4 + address
联系我们 contact @ memedata.com