我的第一个 Linux 内核补丁

我的第一个 Linux 内核补丁
My first patch to the Linux kernel

原始链接: https://pooladkhay.com/posts/first-kernel-patch/

## 从令人头疼的bug到Linux内核补丁一位开发者的虚拟化之旅，导致了一次令人沮丧的调试经历，并最终促成了他们对Linux内核的首次贡献。在构建Type-2虚拟机监控程序时，他们在CPU核心之间切换时遇到了看似随机的系统崩溃。问题源于一段从KVM自检中借用的代码片段中的一个微妙的符号扩展错误，用于检索任务状态段（TSS）的基础地址。这段代码依赖于对较小的整数类型进行位移，从而触发隐式整数提升和随后的C语言中的符号扩展。这破坏了计算出的TSS地址，导致在上下文切换期间发生崩溃——具体来说，在处理中断时发生双重错误。经过数天的调查，排除了他们自己的代码作为罪魁祸首，开发者通过将有问题代码与KVM本身处理TSS地址检索的方式进行比较，发现了问题。一个简单的修复方法——在位移之前将值显式转换为无符号类型——阻止了符号扩展。最终的补丁被接受并合并到Linux内核中。有趣的是，虽然AI工具帮助分析了日志，但它们未能识别根本原因，这凸显了人类调试技能的持续重要性。

## 第一个 Linux 内核补丁与 C 语言的陷阱一位开发者详细描述了提交第一个 Linux 内核补丁的经历，重点关注一个与 C 语言中按位左移时符号扩展相关的微妙错误。该问题导致间歇性系统崩溃，表现为“僵尸状态”的主板。调试非常具有挑战性，即使使用 LLM 等工具也错误地将问题归因于硬件。根本原因在于 C 标准中关于左移有符号整数的未定义行为，特别是当将 `uint8_t` 提升到 `int` 时。讨论强调了 C 语言隐式转换的复杂性以及可能出现的意外行为。评论者强调了严格编码实践、编译器警告（如 `-Wsign-conversion`）以及对 C 标准的深入理解，以避免此类陷阱的重要性。许多人认为 C 语言的隐式转换是一个长期存在的设计缺陷。该帖子还涉及了在开源项目中导航不成文规则的困难以及为内核做出贡献的成就感。

How a sign-extension bug in C made me pull my hair out for days but became my first patch to the Linux kernel!

Intro

A while ago, I started dipping my toe into virtualization. It's a topic that many people have heard of or are using on a daily basis but a few know and think about how it works under the hood.

I like to learn by reinventing the wheel, and naturally, to learn virtualization I started by trying to build a Type-2 hypervisor. This approach is similar to how KVM (Linux) or bhyve (FreeBSD) are built.

Since virtualization is hardware assisted these days , the hypervisor needs to communicate directly with the CPU by running certain privileged instructions; which means a Type-2 hypervisor is essentially a Kernel Module that exposes an API to the user-space where a Virtual Machine Monitor (VMM) like QEMU or Firecracker is running and orchestrating VMs by utilizing that API.

In this post, I want to describe exactly how I found that bug. But to make it a bit more educational, I'm going to set the stage first and talk about a few core concepts so you can see exactly where the bug emerges.

x86 Task State Segment (TSS)

The x86 architecture in protected mode (32-bit mode) envisions a task switching mechanism that is facilitated by the hardware. The architecture defines a Task State Segment (TSS) which is a region in the memory that holds information about a task (General purpose registers, segment registers, etc.). The idea was that any given task or thread would have its own TSS, and when the switch happens, a specific register (Task Register or TR) would get updated to point to the new task .

This was abandoned in favor of software-defined task switching which gives more granular control and portability to the operating system kernel.

But the TSS was not entirely abandoned. In modern days (64-bit systems) the kernel uses a TSS-per-core approach where the main job of TSS is to hold a few stack pointers that are very critical for the kernel and CPU's normal operation. More specifically, it holds the kernel stack of the current thread which is used when the system wants to switch from user-space to the kernel-space.
It also holds a few known good stacks for critical events like Non-Maskable Interrupts (NMIs) and Double Faults. These are events that if not handled correctly, can cause a triple fault and crash a CPU core or cause an immediate system reboot.

We know that memory access is generally considered to be expensive and caching values somewhere on the CPU die is the preferred approach if possible. This is where the TR register comes into the picture. It has a visible part which is a 16-bit offset that we have already discussed as well as a hidden part that holds direct information about the TSS (Base address, Limit, and Access rights). This saves the CPU the trouble of indexing into the GDT to eventually find the TSS every time it's needed.

Why do hypervisors care?

A hypervisor is essentially a task switcher where tasks are operating systems. In order for multiple operating systems to run on the same silicon chip, the hypervisor must swap the entire state of the CPU which includes updating the hidden part of the TR register as well.

In a previous blog post I described how Intel implemented their virtualization extension (VT-x) and how each vCPU (vCore) is given its own VMCS (Virtual Machine Control Structure) block where its state is saved to or restored from by the hardware when switching between host and guest OSes.

I suggest reading that post if you're interested in the topic but VMCS consists of four main areas:

Host-state area
Guest-state area
Control fields
VM-exit information area

Host-state area has two fields which correspond to the visible part and one of the hidden parts (base address) the TR register:

HOST_TR_SELECTOR (16 bits)
HOST_TR_BASE (natural width )

While guest-state area has four (one visible plus all three hidden parts):

GUEST_TR_SELECTOR (16 bits)
GUEST_TR_BASE (natural width )
GUEST_TR_LIMIT (32 bits)
GUEST_TR_ACCESS_RIGHTS (32 bits)

The reason is that the hardware assumes the host OS to be a modern 64-bit operating system where TR limit and Access Rights are fixed known values (0x67 and 0x11 respectively). But the guest OS can be virtually any operating system with any constraints.

Naturally, it is the hypervisor's job to set these values on initial run and to update them when needed (e.g. when the kernel thread that is running a vCPU is migrated to another physical CPU core, the hypervisor must update the host state to match the new core).

To set these values, I "borrowed" some code from the linux kernel tree (KVM selftests):

vmwrite(HOST_TR_BASE,
		get_desc64_base((struct desc64 *)(get_gdt().address + get_tr())));

This piece of code does the following:

Gets the address of GDT.
Indexes into it using the value of TR register.
Parses the TSS segment descriptor and extracts the memory address of TSS.
Writes the address into the HOST_TR_BASE section of the VMCS using the special VMWRITE instruction .

So far, so good!

If for any reason this operation fails to extract and write the correct address, upon the next context switch from user-space to kernel-space (or next NMI or next Double fault), when the CPU hardware tries to read the kernel stack from the TSS to update the Stack Pointer register, it either receives garbage or an unmapped address. Either way, the CPU will eventually face a double fault (a fault that happens when trying to handle another fault like a page fault) and when trying to use one of the known good stacks for handling the double fault, it will fail again which will make it a triple fault and BOOM! The core dies or we get a sudden reboot.

Symptoms

Now lets talk about the issue that I was facing.

I started developing my hypervisor on a virtualized instance of Fedora, to avoid crashing my machine in case something went wrong. By the time I realized something is indeed wrong, I had already developed the ability to put the CPU in VMX operation, run a hardcoded loop in VMX non-root mode that would use the VMCALL instruction to trap into the hypervisor (VMX root) and ask it to print a message, then resume the loop (VMRESUME).

Additionally, VMCS was programmed to trap external interrupts (e.g. timer ticks). Upon an exit, the hypervisor would check if we (the current kernel thread) needs to be rescheduled, keeping the kernel scheduler happy.

I was using preempt notifier api which lets threads provide two custom functions (sched_in and sched_out) that are called by the scheduler when it's about to deschedule the thread as well as right before rescheduling it. These functions are then responsible for cleanups and initialization work that is required.

In my case, sched_out would unload the VMCS from the current core, and sched_in would load it on the new core while reinitializing it using a series of VMWRITEs to match the new core's state.

On my virtualized dev environment with only three vCPUs, everything was working just fine. Until I decided to give it a try on my main machine where the hypervisor would talk to an actual physical CPU.

And BOOM!

Seconds after running the loop, the system crashed, in a very unpredictable way. I was logging the core switches and didn't find any meaningful correlation between the last core number and the crash. Additionally, sometimes it would last longer and sometimes it was immediate. After investigating kernel logs a few times, I saw a pattern in the sequence of events that caused the system to eventually hang:

The Fatal VM-Exit: An NMI triggered a VM-Exit on CPU 5 and naturally the hardware tried to locate a valid kernel stack from TSS to handle the privilege transition.
Core Death: CPU 5 hit a fatal Page Fault attempting to read an unmapped memory address, resulting in a Kernel Oops. CPU 5 was left completely paralyzed with interrupts disabled.
IPI Lockup: CPU 6 attempted a routine system-wide update (kernel text patching) requiring an Inter-Processor Interrupt (IPI) acknowledgment from all cores. CPU 6 became permanently stuck in an infinite loop waiting for the dead CPU 5 to respond.
Cascading Paralysis: As other cores (3, 8, 11, etc.) attempted standard cross-core communications (like memory map TLB flushes and RCU synchronizations), they too fell into the IPI trap, waiting indefinitely for CPU 5.
Terminal State: The RCU subsystem starved, peripheral drivers (like Wi-Fi) crashed from timeouts, and the system entered a total, unrecoverable deadlock.

So why no triple faults?!

The Kernel Oops killed the active task and halted operations on CPU 5. However, it left CPU 5 in a "zombie" state. Alive enough to keep the motherboard powered on, but with its interrupts disabled, making it entirely unresponsive to the rest of the system.

The quest

Soon I realized that the hypervisor works absolutely fine when pinned to one core (e.g. via taskset command), so there must be something happening while moving between cores. Additionally, I didn't dare to question the code I stole from the Linux kernel source, and I was trying hard to find an issue in the code I wrote myself. This eventually led to rewriting a portion of the hypervisor code with an alternative method which would achieve the same goal.

For example, from reading Intel's Software Developer Manual (SDM) , I knew that when moving from core A to core B, core A must run the VMCLEAR instruction to unload the VMCS, and only then can core B load the VMCS using the VMPTRLD to be able to execute the guest code. For that, I was using smp_call_function_single which relies on IPIs to run a piece of code on another CPU, that I replaced with the preempt notifiers.

Eventually, (while pulling my hair out) I realized I have eliminated all possible parts of the hypervisor that played a role in moving between cores.

Then there was another clue!

While running the hypervisor on my virtual dev environment (QEMU + Fedora) I observed that by increasing the number of vCores, I can reproduce the issue and there is also a new behavior. Sometimes the VM reboots immediately (instead of freezing) and after the reboot, there is no trace of any logs related to the previous session. And I concluded that a triple fault has happened.

This turned my attention to the TR and TSS. I started looking for alternative ways of setting the HOST_TR_BASE and realized that the KVM itself (not KVM selftests) uses a different method:





vmcs_writel(HOST_TR_BASE, (unsigned long)&get_cpu_entry_area(cpu)->tss.x86_tss);

And that was it! Using this method to set HOST_TR_BASE fixed my hypervisor and helped me keep whatever sanity I had left.

The smoking gun

Remember that piece of code I took from the kernel source. It used the get_desc64_base function to extract and write the address of TSS into the HOST_TR_BASE. This function has this definition:

static inline uint64_t get_desc64_base(const struct desc64 *desc)
{
    return ((uint64_t)desc->base3 << 32) |
		(desc->base0 | ((desc->base1) << 16) | ((desc->base2) << 24));
}

TSS segment descriptor has four fields that must be stitched together to form the address of the TSS .

base0 is uint16_t.
base1 is uint8_t.
base2 is uint8_t.
base3 is uint32_t.

The C standard dictates Integer Promotion. Whenever a type smaller than an int is used in an expression, the compiler automatically promotes it to a standard int (which is a 32-bit signed integer on modern x86-64 architectures) before performing the operation.

If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. All other types are unchanged by the integer promotions.

This promotion has a consequence: if the resulting value after promotion has a 1 in its most significant bit (32nd bit), this value considered negative by the compiler and if casted to a larger type like a uint64_t in our case, sign extension happens.

Lets see an example:

We have an 8-bit unsigned integer (uint8_t) with 11001100 bit pattern. If we left-shift it by 24, it still can be represented by an int which is 32 bits long. So the compiler generates this value: 11001100000000000000000000000000 and considers it to be an int which is a signed type.

Now if we try to perform any operation on this value, it would follow the protocol for signed values. In our case, we are ORing it with a uint64_t. So the compiler would cast our int (a 32-bit signed value) into uint64_t (a 64-bit unsigned value), which is where the sign-extension happens which would turn our value to 11111111111111111111111111111111_11001100000000000000000000000000 before OR happens.

Saw the problem?
Because the upper 32 bits are sign-extended to all 1s (Hex: 0xFFFFFFFF), the bitwise OR operation completely destroys base3 (In a bitwise OR, 1 | X equals 1). Therefore, whatever data was in base3 is permanently overwritten by the 1s from the sign extension.

Here is an actual example with "real" addresses:

base0 = 0x5000
base1 = 0xd6
base2 = 0xf8
base3 = 0xfffffe7c

Expected return: 0xfffffe7cf8d65000
Actual return:   0xfffffffff8d65000

This also explains when the problem would happen: Only and only if base2 has a 1 as its most significant bit. Any other value would not corrupt the resulting address.

The Kernel patch

The fix is actually very simple. We must cast values to unsigned types before the bit-shift operation:

static inline uint64_t get_desc64_base(const struct desc64 *desc)
{
    return (uint64_t)desc->base3 << 32 |
           (uint64_t)desc->base2 << 24 |
           (uint64_t)desc->base1 << 16 |
           (uint64_t)desc->base0;
}

This will prevent the sign-extension from happening.

Finally, this is the patch I sent, which was approved and merged:
https://lore.kernel.org/kvm/[email protected]/

Outro

I can't finish this post without talking about AI!

You may wonder whether I tried asking an LLM for help or not. Well, I did. In fact it was very helpful in some tasks like summarizing kernel logs [^13] and extracting the gist of them. But when it came to debugging based on all the clues that were available, it concluded that my code didn't have any bugs, and that the CPU hardware was faulty.

CASE CLOSED.