Key Takeaways
- Goroutine leaks are silent killers - they grow slowly until critical
- Always use context.Context for goroutine lifecycle management
- Monitor runtime.NumGoroutine() in production
- Unbuffered channels without readers are the #1 cause of leaks
- Use pprof and runtime/trace for diagnosis
The Symptoms That Everyone Ignored
It started innocently. A developer mentioned the API felt "sluggish" during sprint review. QA reported timeouts were "slightly higher." DevOps noted memory was "trending up but within limits."
Everyone had a piece of the puzzle. Nobody saw the picture.
Here's what we were looking at:
Week 1: 1,200 goroutines, 2.1GB RAM, 250ms p99 latency
Week 2: 3,400 goroutines, 3.8GB RAM, 380ms p99 latency
Week 3: 8,900 goroutines, 7.2GB RAM, 610ms p99 latency
Week 4: 19,000 goroutines, 14GB RAM, 1.4s p99 latency
Week 5: 34,000 goroutines, 28GB RAM, 8.3s p99 latency
Week 6: 50,847 goroutines, 47GB RAM, 32s p99 latency ← You are here
Classic exponential growth. Classic "someone else's problem."
The Code That Looked Perfectly Fine
The leak was in our WebSocket notification system. Here's the simplified version:
func (s *NotificationService) Subscribe(userID string, ws *websocket.Conn) {
ctx, cancel := context.WithCancel(context.Background())
sub := &subscription{
userID: userID,
ws: ws,
cancel: cancel,
}
s.subscribers[userID] = sub
// Start the message pump
go s.pumpMessages(ctx, sub)
// Start the heartbeat
go s.heartbeat(ctx, sub)
}
func (s *NotificationService) pumpMessages(ctx context.Context, sub *subscription) {
for {
select {
case
Looks reasonable, right? Context for cancellation. Cleanup on Done(). This passed code review from three senior engineers.
Enter LeakProf: Uber's Secret Weapon
After staring at code for 2 hours, I remembered Uber's blog post about LeakProf. It's like pprof but specifically for finding goroutine leaks.
Installation took 30 seconds:
import (
"go.uber.org/goleak"
)
func TestNotificationService_NoLeak(t *testing.T) {
defer goleak.VerifyNone(t)
// Your test here
service := NewNotificationService()
ws := mockWebSocket()
service.Subscribe("user123", ws)
// Simulate disconnect
ws.Close()
time.Sleep(100 * time.Millisecond) // Let cleanup happen
}
The test failed immediately:
found unexpected goroutines:
[Goroutine 18 in state select, with NotificationService.pumpMessages on top of the stack:
goroutine 18 [select]:
service.(*NotificationService).pumpMessages(0xc0001d4000, {0x1038c20, 0xc0001d6000}, 0xc0001d8000)
/app/notification.go:45 +0x85
created by service.(*NotificationService).Subscribe
/app/notification.go:32 +0x1a5
Goroutine 19 in state select, with NotificationService.heartbeat on top of the stack:
goroutine 19 [select]:
service.(*NotificationService).heartbeat(0xc0001d4000, {0x1038c20, 0xc0001d6000}, 0xc0001d8000)
/app/notification.go:58 +0x92
]
Two goroutines still running after the WebSocket closed. But why?
The Three Bugs That Created a Perfect Storm
Bug #1: Nobody Called Cancel
func (s *NotificationService) Subscribe(userID string, ws *websocket.Conn) {
ctx, cancel := context.WithCancel(context.Background())
// We save the subscription...
sub := &subscription{
userID: userID,
ws: ws,
cancel: cancel, // But who calls this?
}
}
When the WebSocket disconnected, we never called cancel(). The goroutines lived forever, waiting for a context that would never close.
Bug #2: The Heartbeat Ticker Memory Leak
func (s *NotificationService) heartbeat(ctx context.Context, sub *subscription) {
ticker := time.NewTicker(30 * time.Second)
// WHERE IS ticker.Stop() ???
for {
select {
case
Every leaked goroutine held a ticker. Tickers use a global timer heap. 50,000 tickers = very unhappy runtime.
Bug #3: The Channel That Never Closed
type subscription struct {
userID string
ws *websocket.Conn
messages chan Message // Who closes this?
cancel context.CancelFunc
}
Writers kept sending to sub.messages. The channel grew. Memory grew. Pain grew.
The Production Debugging Session From Hell
We couldn't just restart production. Too many active users. We needed to find and fix it live.
Step 1: Get a goroutine dump:
curl http://api-server:6060/debug/pprof/goroutine?debug=2 > goroutines.txt
Step 2: Analyze the patterns:
# Count goroutines by function
grep "^goroutine" goroutines.txt | sort | uniq -c | sort -rn
# Results:
# 25,423 NotificationService.pumpMessages
# 25,423 NotificationService.heartbeat
# 12 http.(*conn).serve
# 8 runtime.gcBgMarkWorker
# ... normal stuff ...
50,846 goroutines in our notification service. We had about 1,000 active WebSocket connections.
Step 3: Find the pattern:
// Added emergency diagnostics endpoint
http.HandleFunc("/debug/subscriptions", func(w http.ResponseWriter, r *http.Request) {
s.mu.Lock()
defer s.mu.Unlock()
active := 0
for _, sub := range s.subscribers {
// Try to ping the connection
err := sub.ws.WriteControl(websocket.PingMessage, nil, time.Now().Add(time.Second))
if err == nil {
active++
}
}
fmt.Fprintf(w, "Total subscriptions: %d\n", len(s.subscribers))
fmt.Fprintf(w, "Active connections: %d\n", active)
fmt.Fprintf(w, "Leaked goroutines: ~%d\n", (len(s.subscribers) - active) * 2)
})
Result:
Total subscriptions: 25,423
Active connections: 1,047
Leaked goroutines: ~48,752
Bingo. We were keeping subscriptions for dead connections.
The Fix That Saved the Weekend
Here's the fixed version:
func (s *NotificationService) Subscribe(userID string, ws *websocket.Conn) {
ctx, cancel := context.WithCancel(context.Background())
sub := &subscription{
userID: userID,
ws: ws,
messages: make(chan Message, 10),
cancel: cancel,
}
s.mu.Lock()
s.subscribers[userID] = sub
s.mu.Unlock()
// Critical: Setup cleanup handler
ws.SetCloseHandler(func(code int, text string) error {
s.Unsubscribe(userID)
return nil
})
// Start goroutines
go s.pumpMessages(ctx, sub)
go s.heartbeat(ctx, sub)
// Monitor the connection
go s.monitorConnection(ctx, sub)
}
func (s *NotificationService) Unsubscribe(userID string) {
s.mu.Lock()
defer s.mu.Unlock()
if sub, exists := s.subscribers[userID]; exists {
sub.cancel() // Stop goroutines
close(sub.messages) // Close channel
delete(s.subscribers, userID) // Remove reference
}
}
func (s *NotificationService) monitorConnection(ctx context.Context, sub *subscription) {
defer s.Unsubscribe(sub.userID) // Cleanup on exit
for {
select {
case
The Gradual Recovery
We couldn't just deploy and pray. 50,000 goroutines don't just disappear.
Phase 1: Stop the bleeding (deployed immediately):
// Emergency goroutine limiter
if runtime.NumGoroutine() > 10000 {
http.Error(w, "Server overloaded", 503)
return
}
Phase 2: Clean up existing leaks (ran manually):
// One-time cleanup script
func emergencyCleanup() {
for userID, sub := range s.subscribers {
err := sub.ws.WriteControl(websocket.PingMessage, nil, time.Now().Add(time.Second))
if err != nil {
// Dead connection
s.Unsubscribe(userID)
}
}
}
Phase 3: Monitor the recovery:
3:00 AM: 50,847 goroutines, 47GB RAM
3:15 AM: 45,231 goroutines, 43GB RAM (cleanup running)
3:30 AM: 32,109 goroutines, 31GB RAM
4:00 AM: 15,443 goroutines, 18GB RAM
5:00 AM: 3,221 goroutines, 5.4GB RAM
6:00 AM: 1,098 goroutines, 2.1GB RAM ← Normal
The Monitoring We Should Have Had
We added these alerts immediately:
// Prometheus metrics
var (
goroutineGauge = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "go_goroutines_count",
Help: "Current number of goroutines",
},
)
subscriptionGauge = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "websocket_subscriptions_total",
Help: "Total WebSocket subscriptions",
},
)
activeConnectionsGauge = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "websocket_connections_active",
Help: "Active WebSocket connections",
},
)
)
// Update every 10 seconds
go func() {
for range time.Tick(10 * time.Second) {
goroutineGauge.Set(float64(runtime.NumGoroutine()))
s.mu.RLock()
subscriptionGauge.Set(float64(len(s.subscribers)))
s.mu.RUnlock()
activeConnectionsGauge.Set(float64(s.countActiveConnections()))
}
}()
Alert configuration:
- alert: GoroutineLeakSuspected
expr: go_goroutines_count > 5000
for: 10m
annotations:
summary: "Possible goroutine leak detected"
- alert: WebSocketLeakDetected
expr: websocket_subscriptions_total > websocket_connections_active * 1.5
for: 5m
annotations:
summary: "WebSocket subscriptions exceeding active connections"
The Testing Strategy That Would Have Caught This
func TestWebSocketNoGoroutineLeak(t *testing.T) {
// Baseline goroutines
runtime.GC()
baseline := runtime.NumGoroutine()
service := NewNotificationService()
// Simulate 100 connections
for i := 0; i
And the continuous leak detector:
func TestContinuousWebSocketLoad(t *testing.T) {
defer goleak.VerifyNone(t,
goleak.IgnoreTopFunction("net/http.(*Server).Serve"),
)
service := NewNotificationService()
// Simulate realistic usage
for hour := 0; hour 1000 {
t.Fatalf("Goroutine leak: %d goroutines after %d hours",
runtime.NumGoroutine(), hour)
}
}
}
Security Considerations
Security Implications of Goroutine Leaks
- DoS Attack Vector: Attackers can trigger goroutine creation to exhaust resources
- Memory Exhaustion: Leads to OOM kills and service unavailability
- Timing Attacks: Degraded performance can expose timing vulnerabilities
- Resource Starvation: Can prevent legitimate requests from being processed
Secure Goroutine Management
// Rate limit goroutine creation
type GoroutinePool struct {
sem chan struct{}
wg sync.WaitGroup
ctx context.Context
cancel context.CancelFunc
}
func NewGoroutinePool(maxGoroutines int) *GoroutinePool {
ctx, cancel := context.WithCancel(context.Background())
return &GoroutinePool{
sem: make(chan struct{}, maxGoroutines),
ctx: ctx,
cancel: cancel,
}
}
func (p *GoroutinePool) Go(fn func()) error {
select {
case p.sem
Enhanced Testing Strategy
1. Leak Detection Tests
func TestNoGoroutineLeaks(t *testing.T) {
defer goleak.VerifyNone(t,
goleak.IgnoreTopFunction("database/sql.(*DB).connectionOpener"),
)
// Your test code
service := NewService()
service.Start()
defer service.Stop()
// Simulate load
for i := 0; i
2. Load Testing for Leaks
func TestGoroutineGrowthUnderLoad(t *testing.T) {
initial := runtime.NumGoroutine()
service := NewService()
defer service.Stop()
// Generate load
for i := 0; i 100 {
t.Fatalf("Excessive goroutine growth: %d", growth)
}
}
}
}
3. Benchmark with Goroutine Tracking
func BenchmarkServiceWithGoroutineTracking(b *testing.B) {
initial := runtime.NumGoroutine()
b.ResetTimer()
for i := 0; i initial+10 {
b.Fatalf("Goroutine leak detected: %d -> %d", initial, final)
}
}
Lessons Burned Into My Brain
1. Every Goroutine Needs an Exit Strategy
// Bad: Fire and forget
go doSomething()
// Good: Controlled lifecycle
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
go doSomething(ctx)
2. Tickers Are Not Garbage Collected
// This leaks
ticker := time.NewTicker(time.Second)
// This doesn't
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
3. Monitor Goroutines Like Memory
If you monitor memory usage, monitor goroutine count. They're equally important.
4. Test for Leaks, Not Just Correctness
// Add to every concurrent test
defer goleak.VerifyNone(t)
5. WebSockets Are Goroutine Factories
Every WebSocket connection typically spawns 2-3 goroutines. 10,000 connections = 30,000 goroutines. Plan accordingly.
The Cost of This Bug
- 6 weeks of degraded performance
- ~500 customer complaints
- 3 engineers × 20 hours debugging
- $4,000 in extra AWS costs (RAM scaling)
- 1 very stressed on-call rotation
- Immeasurable reputation damage
All because we forgot to call cancel() and ticker.Stop().
The Bottom Line
Goroutine leaks are memory leaks with extra steps. They're harder to spot, harder to debug, and cause weird cascading failures.
But they're easy to prevent:
- Every
goneeds a way to stop - Every
NewTickerneeds aStop() - Every
make(chan)needs aclose() - Every
Subscribeneeds anUnsubscribe
And for the love of all that is holy, use goleak in your tests.
P.S. We now have a pre-commit hook that looks for time.NewTicker without defer ticker.Stop(). It's rejected 17 PRs so far. Each one could have been another 3 AM wake-up call. Worth every false positive.