自旋锁 vs. 互斥锁:何时自旋,何时休眠
Spinlocks vs. Mutexes: When to Spin and When to Sleep

原始链接: https://howtech.substack.com/p/spinlocks-vs-mutexes-when-to-spin

## 同步原语:互斥锁 vs. 自旋锁 选择合适的同步原语对性能至关重要。互斥锁和自旋锁都能保护临界区,但失败方式相反:互斥锁会*休眠*(引入系统调用开销),而自旋锁会*消耗 CPU* 等待。 自旋锁在用户空间使用原子比较交换操作,避免了系统调用,但会持续占用 100% CPU,直到锁可用。 这会导致缓存行在核心之间跳动,浪费能量。 互斥锁利用 `futex()` 系统调用,当出现竞争时会导致上下文切换和调度器参与。 自旋锁在支持抢占的系统中很危险——持有自旋锁的被抢占线程可能导致其他线程无限自旋。 现代互斥锁具有快速路径,在无竞争时效率惊人。 **指南:** * **<100ns,低竞争:** 自旋锁。自旋比上下文切换更快。 * **100ns-10μs,中等竞争:** 混合/自适应互斥锁(短暂自旋,然后休眠)。 * **>10μs 或高竞争:** 正常互斥锁。让调度器管理线程。 **性能分析是关键:** 使用 `perf stat` 监控上下文切换和缓存缺失,`strace -c` 统计系统调用次数,以及 `/proc/PID/status` 分析上下文切换类型。 最佳选择取决于您的特定临界区持续时间和竞争级别——测量,不要猜测!

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 自旋锁 vs. 互斥锁:何时自旋,何时休眠 (howtech.substack.com) 8 点赞 birdculture 26 分钟前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

You’re staring at perf top showing 60% CPU time in pthread_mutex_lock. Your latency is in the toilet. Someone suggests “just use a spinlock” and suddenly your 16-core server is pegged at 100% doing nothing useful. This is the synchronization primitive trap, and most engineers step right into it because nobody explains when each primitive actually makes sense.

Mutexes sleep. Spinlocks burn CPU. Both protect your critical section, but they fail in opposite ways. A mutex that sleeps for 3 microseconds is a disaster when your critical section is 50 nanoseconds. A spinlock that burns CPU for 10 milliseconds is a waste when you could’ve let the thread do other work.

Here’s what actually happens. A spinlock does a LOCK CMPXCHG in userspace—an atomic compare-and-swap that keeps looping until it wins. Zero syscalls, but 100% CPU usage while waiting. Every failed attempt bounces the cache line between CPU cores at ~40-80ns per bounce. With four threads fighting over one lock, you’re just burning electricity.

A mutex tries userspace first with a futex fast path, but when contention hits, it calls futex(FUTEX_WAIT). That’s a syscall (~500ns), a context switch (~3-5μs), and your thread goes to sleep. When the lock releases, another syscall wakes you up. The thread scheduler gets involved. You pay the full cost of sleeping and waking.

Spinlocks are dangerous in preemptible contexts. Your thread holds the spinlock, gets preempted by the scheduler, and now three other threads are spinning for the full timeslice (100ms on Linux). They’re burning CPU waiting for a thread that isn’t even running. This is why the Linux kernel disables preemption around spinlocks—but you can’t do that in userspace.

Mutex fast paths are actually pretty fast. Glibc’s pthread mutex does an atomic operation first, just like a spinlock. Only when that fails does it call futex(). An uncontended mutex costs 25-50ns, not the microseconds you’d expect from syscall overhead. The syscall only happens under contention.

Priority inversion will bite you. Low-priority thread holds a spinlock. High-priority thread starts spinning. Low-priority thread never gets CPU time because the high-priority thread is hogging the CPU spinning. Deadlock. Priority Inheritance (PI) mutexes solve this by temporarily boosting the lock holder’s priority. Spinlocks can’t do that.

Cache line bouncing is your enemy. Every atomic operation invalidates the cache line on all other CPUs. Put two different spinlocks on the same 64-byte cache line (false sharing) and they contend even though they’re protecting different data. Modern allocators pad locks to cache line boundaries—alignas(64) in C++ or __attribute__((aligned(64))) in C.

Critical section under 100ns, low contention (2-4 threads): Spinlock. You’ll waste less time spinning than you would on a context switch.

Critical section 100ns-10μs, moderate contention: Hybrid mutex (glibc adaptive mutex spins briefly then sleeps). PostgreSQL’s LWLock does exactly this.

Critical section over 10μs or high contention: Regular mutex. Let the scheduler do its job. Spinning wastes CPU that could run other threads.

Real-time requirements: Priority Inheritance mutex on a PREEMPT_RT kernel. Spinlocks cause priority inversion. Bounded latency matters more than average-case performance.

Run perf stat -e context-switches,cache-misses on your process. High context-switches with low CPU usage means mutex overhead might be killing you—consider adaptive mutexes. High cache-misses with 100% CPU usage means cache line bouncing—your locks are too contended or you have false sharing.

Use strace -c to count syscalls. Every futex() call is a contended mutex. If you’re seeing millions per second, you have a hot lock that might benefit from sharding or lock-free techniques.

Check /proc/PID/status for voluntary vs involuntary context switches. Voluntary switches are threads yielding or blocking—normal. Involuntary switches mean the scheduler is preempting threads, possibly while they hold spinlocks.

Redis uses spinlocks for its tiny job queue because critical sections are under 50ns. PostgreSQL’s buffer pool uses spinlocks for lookup operations (nanoseconds) but mutexes for I/O operations (milliseconds). Nginx avoids the problem entirely with a multi-process architecture—no shared memory means no locks.

The Linux kernel learned this the hard way. Early 2.6 kernels used spinlocks everywhere, wasting 10-20% CPU on contended locks because preemption would stretch what should’ve been 100ns holds into milliseconds. Modern kernels use mutexes for most subsystems.

Your job is knowing your hold time and contention level. Profile with perf, measure with rdtsc, and choose the primitive that wastes the least time. A spinlock that spins for 200ns is fine. A spinlock that spins for 10ms is catastrophic. A mutex that sleeps for 5μs is fine. A mutex that sleeps for 20ns worth of work is wasteful.

The right answer isn’t “always use X.” It’s “measure your critical section, count your threads, and pick the tool that matches your problem.”

Let’s prove everything we just discussed with working code. You’ll build two programs—one using a spinlock, one using a mutex—and watch them behave exactly as described above.

https://github.com/sysdr/howtech/tree/main/spinlock_vs_mutexes/generated

Three programs that work together:

A spinlock test that keeps your CPU at 100% while threads fight for the lock. A mutex test where threads politely sleep when blocked. A monitor that watches both programs live and shows you the CPU usage and context switch differences in real time.

First, we build a custom spinlock using C11 atomic operations. The key is atomic_compare_exchange_weak—it reads the lock, checks if it’s 0 (unlocked), and if so, sets it to 1 (locked) all in one atomic instruction. If someone else holds the lock, it keeps trying in a loop.

Save this as spinlock_test.c:

c

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <stdatomic.h>
#include <time.h>
#include <sched.h>

#define NUM_THREADS 4
#define ITERATIONS 1000000
#define HOLD_TIME_NS 100

typedef struct {
    atomic_int lock;
    long counter;
} spinlock_t;

void spinlock_acquire(spinlock_t *s) {
    int expected;
    do {
        expected = 0;
    } while (!atomic_compare_exchange_weak(&s->lock, &expected, 1));
}

void spinlock_release(spinlock_t *s) {
    atomic_store(&s->lock, 0);
}

void* worker_thread(void *arg) {
    spinlock_t *lock = (spinlock_t *)arg;
    
    for (long i = 0; i < ITERATIONS; i++) {
        spinlock_acquire(lock);
        lock->counter++;
        // Simulate 100ns of work
        for (volatile int j = 0; j < 10; j++);
        spinlock_release(lock);
    }
    return NULL;
}

int main() {
    pthread_t threads[NUM_THREADS];
    spinlock_t lock = { .lock = 0, .counter = 0 };
    struct timespec start, end;
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_create(&threads[i], NULL, worker_thread, &lock);
    }
    
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    double elapsed = (end.tv_sec - start.tv_sec) + 
                     (end.tv_nsec - start.tv_nsec) / 1e9;
    
    printf(”SPINLOCK Results:\n”);
    printf(”  Final counter: %ld\n”, lock.counter);
    printf(”  Time: %.3f seconds\n”, elapsed);
    printf(”  Operations/sec: %.0f\n”, (NUM_THREADS * ITERATIONS) / elapsed);
    printf(”  CPU usage: 100%% (busy-waiting)\n”);
    
    return 0;
}

The atomic_compare_exchange_weak does the magic. It’s a single CPU instruction (LOCK CMPXCHG on x86) that atomically checks and sets the lock. The “weak” version might fail spuriously on some architectures, but that’s fine—we’re looping anyway.

Now the mutex version. This uses pthread_mutex_t which calls futex() when contention happens. Threads sleep instead of spinning.

Save this as mutex_test.c:

c

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <time.h>

#define NUM_THREADS 4
#define ITERATIONS 1000000

typedef struct {
    pthread_mutex_t mutex;
    long counter;
} mutex_lock_t;

void* worker_thread(void *arg) {
    mutex_lock_t *lock = (mutex_lock_t *)arg;
    
    for (long i = 0; i < ITERATIONS; i++) {
        pthread_mutex_lock(&lock->mutex);
        lock->counter++;
        for (volatile int j = 0; j < 10; j++);
        pthread_mutex_unlock(&lock->mutex);
    }
    return NULL;
}

int main() {
    pthread_t threads[NUM_THREADS];
    mutex_lock_t lock = { .mutex = PTHREAD_MUTEX_INITIALIZER, .counter = 0 };
    struct timespec start, end;
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_create(&threads[i], NULL, worker_thread, &lock);
    }
    
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    double elapsed = (end.tv_sec - start.tv_sec) + 
                     (end.tv_nsec - start.tv_nsec) / 1e9;
    
    printf(”MUTEX Results:\n”);
    printf(”  Final counter: %ld\n”, lock.counter);
    printf(”  Time: %.3f seconds\n”, elapsed);
    printf(”  Operations/sec: %.0f\n”, (NUM_THREADS * ITERATIONS) / elapsed);
    printf(”  CPU efficient (threads sleep when blocked)\n”);
    
    pthread_mutex_destroy(&lock.mutex);
    return 0;
}

Compile both programs with these flags:

bash

gcc -Wall -Wextra -O2 -pthread spinlock_test.c -o spinlock_test
gcc -Wall -Wextra -O2 -pthread mutex_test.c -o mutex_test

The -pthread flag links the pthread library. -O2 enables optimizations without messing up our timing measurements.

Run the spinlock test and watch your CPU usage spike to 100%:

bash

./spinlock_test

While it’s running, open another terminal and check CPU usage:

bash

top -p $(pgrep spinlock_test)

You’ll see all four threads at 100% CPU. They’re spinning, waiting for the lock.

Now run the mutex test:

bash

./mutex_test

Check its CPU usage the same way. You’ll see much lower CPU usage because threads sleep when blocked instead of spinning.

This is where it gets interesting. Use strace to see what syscalls each program makes:

bash

strace -c ./spinlock_test

Look at the syscall summary. You’ll see almost no futex calls. The spinlock never talks to the kernel—it’s all userspace atomic operations.

Now trace the mutex version:

bash

strace -c ./mutex_test

Count those futex calls. Thousands of them. Every time a thread can’t acquire the lock, it calls futex(FUTEX_WAIT) to sleep. When the lock is released, another futex(FUTEX_WAKE) wakes it up. That’s the syscall overhead we talked about.

Look at the /proc filesystem to see context switches:

bash

# While spinlock_test is running in another terminal:
cat /proc/$(pgrep spinlock_test)/status | grep ctxt

# Then for mutex_test:
cat /proc/$(pgrep mutex_test)/status | grep ctxt

Spinlock will show very few voluntary context switches—threads never voluntarily give up the CPU. Mutex will show thousands of voluntary switches—threads sleeping and waking.

If you have perf installed, this shows cache behavior:

bash

perf stat -e cache-misses,cache-references ./spinlock_test
perf stat -e cache-misses,cache-references ./mutex_test

Spinlock will show a higher cache miss rate. That’s the cache line bouncing between CPUs as threads fight for the lock.

Try changing NUM_THREADS to 2, 8, or 16. Watch how spinlock performance degrades with more threads—more CPUs spinning means more wasted cycles. Mutex handles it better because only one thread runs at a time while others sleep.

Change HOLD_TIME_NS by adjusting the busy-wait loop. Make it longer (more iterations) and watch the spinlock waste even more CPU. This proves the point: spinlocks are only good for very short critical sections.

You built working implementations of both primitives and saw the exact behavior we described. Spinlocks burn CPU but have low latency for short holds. Mutexes use syscalls and context switches but keep your CPU available for real work. The choice depends on your critical section duration and contention level.

Now you understand why Redis uses spinlocks for nanosecond operations but PostgreSQL uses mutexes for longer database operations. You’ve seen the futex syscalls, measured the context switches, and watched the CPU usage yourself. This isn’t theory—it’s how production systems actually work.

联系我们 contact @ memedata.com