查找并修复一个差点导致生产环境崩溃的5万个goroutine泄漏问题

查找并修复一个差点导致生产环境崩溃的5万个goroutine泄漏问题
Finding and Fixing a 50k Goroutine Leak That Nearly Killed Production

原始链接: https://skoredin.pro/blog/golang/goroutine-leak-debugging

## Goroutine 泄漏总结本文详细描述了一起由 WebSocket 通知系统中 goroutine 泄漏导致的重要生产事故。最初表现为轻微的性能下降（API 迟缓、超时增加、内存使用量上升），问题迅速升级到严重级别——峰值超过 50,000 个 goroutine 和 47GB 的内存使用量。根本原因有三：未调用 `context.Context` 对象的 `cancel()`，泄漏 `time.Ticker` 实例（这些实例不会自动垃圾回收），以及未关闭的 channel 累积数据。尽管通过了代码审查，但在使用 Uber 的 `goleak` 工具之前，泄漏并未被发现，该工具显示 WebSocket 断开连接后存在持续的 goroutine。修复方案包括在 `Subscribe` 和 `Unsubscribe` 函数中实现适当的清理程序，确保上下文取消、ticker 停止和 channel 关闭。分阶段推出，包括紧急 goroutine 限制和一次性清理脚本，逐步恢复了系统健康。关键经验教训包括：使用 `context.Context` 管理 goroutine 生命周期的关键重要性、主动监控 `runtime.NumGoroutine()`，以及将泄漏检测测试（如 `goleak`）纳入 CI/CD 流程。该事件凸显了 goroutine 泄漏的静默、指数级特性，以及忽视它们所带来的巨大成本——包括财务和声誉方面的损失。

发现并修复差点导致生产环境崩溃的 5 万个 Goroutine 泄漏 (skoredin.pro) 11 分钟前，ibobev 发布 | 隐藏 | 过去 | 收藏 | 1 条评论 Uriopass 3 分钟前 [–] > 我们不能只是部署并祈祷。5 万个 Goroutine 不会自行消失。重启服务器后它们会消失。不确定“第三阶段监控”显示 Goroutine 逐渐减少的情况是什么。如果你有新代码要部署，你必须重新编译，因此重启，那些 Goroutine 也会消失。这感觉像 AI 编造的故事，但我不确定编造这个故事的目的是什么。然而，Goroutine 泄漏很有趣！我希望从这篇文章中学到的东西不是幻觉。例如，订阅者如何向已关闭的 Goroutine 发送消息/心跳而不出错...回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

Key Takeaways

Goroutine leaks are silent killers - they grow slowly until critical
Always use context.Context for goroutine lifecycle management
Monitor runtime.NumGoroutine() in production
Unbuffered channels without readers are the #1 cause of leaks
Use pprof and runtime/trace for diagnosis

The Symptoms That Everyone Ignored

It started innocently. A developer mentioned the API felt "sluggish" during sprint review. QA reported timeouts were "slightly higher." DevOps noted memory was "trending up but within limits."

Everyone had a piece of the puzzle. Nobody saw the picture.

Here's what we were looking at:

Week 1: 1,200 goroutines, 2.1GB RAM, 250ms p99 latency
Week 2: 3,400 goroutines, 3.8GB RAM, 380ms p99 latency  
Week 3: 8,900 goroutines, 7.2GB RAM, 610ms p99 latency
Week 4: 19,000 goroutines, 14GB RAM, 1.4s p99 latency
Week 5: 34,000 goroutines, 28GB RAM, 8.3s p99 latency
Week 6: 50,847 goroutines, 47GB RAM, 32s p99 latency ← You are here

Classic exponential growth. Classic "someone else's problem."

The Code That Looked Perfectly Fine

The leak was in our WebSocket notification system. Here's the simplified version:

func (s *NotificationService) Subscribe(userID string, ws *websocket.Conn) {
    ctx, cancel := context.WithCancel(context.Background())
    
    sub := &subscription{
        userID: userID,
        ws:     ws,
        cancel: cancel,
    }
    
    s.subscribers[userID] = sub
    
    // Start the message pump
    go s.pumpMessages(ctx, sub)
    
    // Start the heartbeat
    go s.heartbeat(ctx, sub)
}

func (s *NotificationService) pumpMessages(ctx context.Context, sub *subscription) {
    for {
        select {
        case

Looks reasonable, right? Context for cancellation. Cleanup on Done(). This passed code review from three senior engineers.

Enter LeakProf: Uber's Secret Weapon

After staring at code for 2 hours, I remembered Uber's blog post about LeakProf. It's like pprof but specifically for finding goroutine leaks.

Installation took 30 seconds:

import (
    "go.uber.org/goleak"
)

func TestNotificationService_NoLeak(t *testing.T) {
    defer goleak.VerifyNone(t)
    
    // Your test here
    service := NewNotificationService()
    ws := mockWebSocket()
    service.Subscribe("user123", ws)
    
    // Simulate disconnect
    ws.Close()
    
    time.Sleep(100 * time.Millisecond) // Let cleanup happen
}

The test failed immediately:

found unexpected goroutines:
[Goroutine 18 in state select, with NotificationService.pumpMessages on top of the stack:
goroutine 18 [select]:
    service.(*NotificationService).pumpMessages(0xc0001d4000, {0x1038c20, 0xc0001d6000}, 0xc0001d8000)
        /app/notification.go:45 +0x85
    created by service.(*NotificationService).Subscribe
        /app/notification.go:32 +0x1a5

Goroutine 19 in state select, with NotificationService.heartbeat on top of the stack:
goroutine 19 [select]:
    service.(*NotificationService).heartbeat(0xc0001d4000, {0x1038c20, 0xc0001d6000}, 0xc0001d8000)
        /app/notification.go:58 +0x92
]

Two goroutines still running after the WebSocket closed. But why?

The Three Bugs That Created a Perfect Storm

Bug #1: Nobody Called Cancel

func (s *NotificationService) Subscribe(userID string, ws *websocket.Conn) {
    ctx, cancel := context.WithCancel(context.Background())
    
    // We save the subscription...
    sub := &subscription{
        userID: userID,
        ws:     ws,
        cancel: cancel,  // But who calls this?
    }
}

When the WebSocket disconnected, we never called cancel(). The goroutines lived forever, waiting for a context that would never close.

Bug #2: The Heartbeat Ticker Memory Leak

func (s *NotificationService) heartbeat(ctx context.Context, sub *subscription) {
    ticker := time.NewTicker(30 * time.Second)
    // WHERE IS ticker.Stop() ???
    
    for {
        select {
        case

Every leaked goroutine held a ticker. Tickers use a global timer heap. 50,000 tickers = very unhappy runtime.

Bug #3: The Channel That Never Closed

type subscription struct {
    userID   string
    ws       *websocket.Conn
    messages chan Message  // Who closes this?
    cancel   context.CancelFunc
}

Writers kept sending to sub.messages. The channel grew. Memory grew. Pain grew.

The Production Debugging Session From Hell

We couldn't just restart production. Too many active users. We needed to find and fix it live.

Step 1: Get a goroutine dump:

curl http://api-server:6060/debug/pprof/goroutine?debug=2 > goroutines.txt

Step 2: Analyze the patterns:

# Count goroutines by function
grep "^goroutine" goroutines.txt | sort | uniq -c | sort -rn

# Results:
# 25,423 NotificationService.pumpMessages
# 25,423 NotificationService.heartbeat
#     12 http.(*conn).serve
#      8 runtime.gcBgMarkWorker
#     ... normal stuff ...

50,846 goroutines in our notification service. We had about 1,000 active WebSocket connections.

Step 3: Find the pattern:

// Added emergency diagnostics endpoint
http.HandleFunc("/debug/subscriptions", func(w http.ResponseWriter, r *http.Request) {
    s.mu.Lock()
    defer s.mu.Unlock()
    
    active := 0
    for _, sub := range s.subscribers {
        // Try to ping the connection
        err := sub.ws.WriteControl(websocket.PingMessage, nil, time.Now().Add(time.Second))
        if err == nil {
            active++
        }
    }
    
    fmt.Fprintf(w, "Total subscriptions: %d\n", len(s.subscribers))
    fmt.Fprintf(w, "Active connections: %d\n", active)
    fmt.Fprintf(w, "Leaked goroutines: ~%d\n", (len(s.subscribers) - active) * 2)
})

Result:

Total subscriptions: 25,423
Active connections: 1,047
Leaked goroutines: ~48,752

Bingo. We were keeping subscriptions for dead connections.

The Fix That Saved the Weekend

Here's the fixed version:

func (s *NotificationService) Subscribe(userID string, ws *websocket.Conn) {
    ctx, cancel := context.WithCancel(context.Background())
    
    sub := &subscription{
        userID:   userID,
        ws:       ws,
        messages: make(chan Message, 10),
        cancel:   cancel,
    }
    
    s.mu.Lock()
    s.subscribers[userID] = sub
    s.mu.Unlock()
    
    // Critical: Setup cleanup handler
    ws.SetCloseHandler(func(code int, text string) error {
        s.Unsubscribe(userID)
        return nil
    })
    
    // Start goroutines
    go s.pumpMessages(ctx, sub)
    go s.heartbeat(ctx, sub)
    
    // Monitor the connection
    go s.monitorConnection(ctx, sub)
}

func (s *NotificationService) Unsubscribe(userID string) {
    s.mu.Lock()
    defer s.mu.Unlock()
    
    if sub, exists := s.subscribers[userID]; exists {
        sub.cancel()                    // Stop goroutines
        close(sub.messages)              // Close channel
        delete(s.subscribers, userID)   // Remove reference
    }
}

func (s *NotificationService) monitorConnection(ctx context.Context, sub *subscription) {
    defer s.Unsubscribe(sub.userID)  // Cleanup on exit
    
    for {
        select {
        case

The Gradual Recovery

We couldn't just deploy and pray. 50,000 goroutines don't just disappear.

Phase 1: Stop the bleeding (deployed immediately):

// Emergency goroutine limiter
if runtime.NumGoroutine() > 10000 {
    http.Error(w, "Server overloaded", 503)
    return
}

Phase 2: Clean up existing leaks (ran manually):

// One-time cleanup script
func emergencyCleanup() {
    for userID, sub := range s.subscribers {
        err := sub.ws.WriteControl(websocket.PingMessage, nil, time.Now().Add(time.Second))
        if err != nil {
            // Dead connection
            s.Unsubscribe(userID)
        }
    }
}

Phase 3: Monitor the recovery:

3:00 AM: 50,847 goroutines, 47GB RAM
3:15 AM: 45,231 goroutines, 43GB RAM (cleanup running)
3:30 AM: 32,109 goroutines, 31GB RAM
4:00 AM: 15,443 goroutines, 18GB RAM
5:00 AM: 3,221 goroutines, 5.4GB RAM
6:00 AM: 1,098 goroutines, 2.1GB RAM ← Normal

The Monitoring We Should Have Had

We added these alerts immediately:

// Prometheus metrics
var (
    goroutineGauge = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "go_goroutines_count",
            Help: "Current number of goroutines",
        },
    )
    
    subscriptionGauge = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "websocket_subscriptions_total",
            Help: "Total WebSocket subscriptions",
        },
    )
    
    activeConnectionsGauge = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "websocket_connections_active",
            Help: "Active WebSocket connections",
        },
    )
)

// Update every 10 seconds
go func() {
    for range time.Tick(10 * time.Second) {
        goroutineGauge.Set(float64(runtime.NumGoroutine()))
        
        s.mu.RLock()
        subscriptionGauge.Set(float64(len(s.subscribers)))
        s.mu.RUnlock()
        
        activeConnectionsGauge.Set(float64(s.countActiveConnections()))
    }
}()

Alert configuration:

- alert: GoroutineLeakSuspected
  expr: go_goroutines_count > 5000
  for: 10m
  annotations:
    summary: "Possible goroutine leak detected"
    
- alert: WebSocketLeakDetected  
  expr: websocket_subscriptions_total > websocket_connections_active * 1.5
  for: 5m
  annotations:
    summary: "WebSocket subscriptions exceeding active connections"

The Testing Strategy That Would Have Caught This

func TestWebSocketNoGoroutineLeak(t *testing.T) {
    // Baseline goroutines
    runtime.GC()
    baseline := runtime.NumGoroutine()
    
    service := NewNotificationService()
    
    // Simulate 100 connections
    for i := 0; i

And the continuous leak detector:

func TestContinuousWebSocketLoad(t *testing.T) {
    defer goleak.VerifyNone(t,
        goleak.IgnoreTopFunction("net/http.(*Server).Serve"),
    )
    
    service := NewNotificationService()
    
    // Simulate realistic usage
    for hour := 0; hour  1000 {
            t.Fatalf("Goroutine leak: %d goroutines after %d hours", 
                runtime.NumGoroutine(), hour)
        }
    }
}

Security Considerations

Security Implications of Goroutine Leaks

DoS Attack Vector: Attackers can trigger goroutine creation to exhaust resources
Memory Exhaustion: Leads to OOM kills and service unavailability
Timing Attacks: Degraded performance can expose timing vulnerabilities
Resource Starvation: Can prevent legitimate requests from being processed

Secure Goroutine Management

// Rate limit goroutine creation
type GoroutinePool struct {
    sem    chan struct{}
    wg     sync.WaitGroup
    ctx    context.Context
    cancel context.CancelFunc
}

func NewGoroutinePool(maxGoroutines int) *GoroutinePool {
    ctx, cancel := context.WithCancel(context.Background())
    return &GoroutinePool{
        sem:    make(chan struct{}, maxGoroutines),
        ctx:    ctx,
        cancel: cancel,
    }
}

func (p *GoroutinePool) Go(fn func()) error {
    select {
    case p.sem

Enhanced Testing Strategy

1. Leak Detection Tests

func TestNoGoroutineLeaks(t *testing.T) {
    defer goleak.VerifyNone(t,
        goleak.IgnoreTopFunction("database/sql.(*DB).connectionOpener"),
    )
    
    // Your test code
    service := NewService()
    service.Start()
    defer service.Stop()
    
    // Simulate load
    for i := 0; i

2. Load Testing for Leaks

func TestGoroutineGrowthUnderLoad(t *testing.T) {
    initial := runtime.NumGoroutine()
    
    service := NewService()
    defer service.Stop()
    
    // Generate load
    for i := 0; i  100 {
                t.Fatalf("Excessive goroutine growth: %d", growth)
            }
        }
    }
}

3. Benchmark with Goroutine Tracking

func BenchmarkServiceWithGoroutineTracking(b *testing.B) {
    initial := runtime.NumGoroutine()
    
    b.ResetTimer()
    for i := 0; i  initial+10 {
        b.Fatalf("Goroutine leak detected: %d -> %d", initial, final)
    }
}

Lessons Burned Into My Brain

1. Every Goroutine Needs an Exit Strategy

// Bad: Fire and forget
go doSomething()

// Good: Controlled lifecycle
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
go doSomething(ctx)

2. Tickers Are Not Garbage Collected

// This leaks
ticker := time.NewTicker(time.Second)

// This doesn't
ticker := time.NewTicker(time.Second)
defer ticker.Stop()

3. Monitor Goroutines Like Memory

If you monitor memory usage, monitor goroutine count. They're equally important.

4. Test for Leaks, Not Just Correctness

// Add to every concurrent test
defer goleak.VerifyNone(t)

5. WebSockets Are Goroutine Factories

Every WebSocket connection typically spawns 2-3 goroutines. 10,000 connections = 30,000 goroutines. Plan accordingly.

The Cost of This Bug

6 weeks of degraded performance
~500 customer complaints
3 engineers × 20 hours debugging
$4,000 in extra AWS costs (RAM scaling)
1 very stressed on-call rotation
Immeasurable reputation damage

All because we forgot to call cancel() and ticker.Stop().

The Bottom Line

Goroutine leaks are memory leaks with extra steps. They're harder to spot, harder to debug, and cause weird cascading failures.

But they're easy to prevent:

Every go needs a way to stop
Every NewTicker needs a Stop()
Every make(chan) needs a close()
Every Subscribe needs an Unsubscribe

And for the love of all that is holy, use goleak in your tests.

P.S. We now have a pre-commit hook that looks for time.NewTicker without defer ticker.Stop(). It's rejected 17 PRs so far. Each one could have been another 3 AM wake-up call. Worth every false positive.