展示HN:音频视频测试的混沌猴子(WebRTC和UDP)
Show HN: Chaos Monkey but for Audio Video Testing (WebRTC and UDP)

原始链接: https://github.com/MdSadiqMd/AV-Chaos-Monkey

## 分布式混沌工程:视频会议系统 该平台通过模拟使用H.264/Opus流的500-1500个WebRTC参与者,并注入真实的网络混沌,对视频会议系统进行负载测试。旨在验证系统在恶化条件下的弹性。 **核心组件:** 一个高效的媒体处理管道,用于编码和缓存媒体(降低约90%的CPU占用),一个通过REST API管理测试的控制平面,以及一个可扩展的参与者池。参与者生成RTP流,并在头部包含唯一的ID。 **混沌注入:** 五种尖峰类型——丢包、抖动、码率降低、帧丢失和带宽限制——通过可配置的策略应用(均匀、随机、前/后加载)。 **部署选项:** 该平台支持本地开发(Go)、Docker Compose(最多500个参与者)和Kubernetes(生产规模)。Kubernetes利用自动分区和UDP中继链来处理高参与者数量。 **可观察性:** 可选的Prometheus/Grafana集成提供实时指标,如参与者数量、丢包率和MOS分数。 **主要特性:** Kubernetes自动配置、使用Nix进行跨平台构建,以及用于测试创建、执行和指标检索的全面API。它支持UDP流聚合和直接WebRTC连接,用于测试SFU/MCU。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交 登录 展示 HN:混沌猴子,但用于音视频测试(WebRTC 和 UDP) (github.com/mdsadiqmd) 4 点赞 来自 MdSadiqMd 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论 它接收一个输入视频,并将其转换为 H.264/Opus RTP 流,你可以将其发送到你的视频通话系统(WebRTC、SFU 等)。它还会注入网络混沌,如丢包、抖动和比特率限制,以查看系统如何崩溃。 它根据主机系统的计算和内存,可以扩展到 1 到 n 个参与者。 最好的部分?它使用 Nix 打包,因此在任何地方都能以相同的方式构建(Linux、macOS、ARM、x86)。没有依赖地狱。 它支持 UDP(带有 Kubernetes 的中继链)和 WebRTC(带有容器化的 TURN 服务器)。混沌峰值可以均匀、随机或前/后加载分布,以用于不同的测试场景。要更改此设置,只需编辑单个配置文件中的值即可。 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式 搜索:
相关文章

原文

Distributed chaos engineering platform for load testing video conferencing systems. Simulates 1500+ WebRTC participants with H.264/Opus streams and injects network chaos spikes to validate system resilience under degraded conditions

image
  1. Media Processing Pipeline:

    • FFmpeg converts input video to H.264 Annex-B and Ogg/Opus at startup
    • NAL Reader parses H.264 stream (SPS/PPS/IDR/Slices)
    • Opus Reader extracts 20ms audio frames from Ogg container
    • Frames cached in memory, shared across all participants (zero-copy)
    • Reduces CPU by ~90% vs per-participant encoding
  2. Control Plane:

    • HTTP Server (:8080) manages test lifecycle via REST API
    • Spike Scheduler distributes chaos events (even/random/front/back/legacy)
    • Network Degrader applies chaos: packet loss (1-25%), jitter (10-50ms), bitrate reduction (30-80%), frame drops (10-60%)
    • Loaded chaos configuration applied to participant pool
  3. Participant Pool:

    • Auto-partitioned across pods using: participant_id % total_partitions = partition_id
    • Each participant generates RTP streams (PT=96 video, PT=111 audio)
    • Participant ID embedded in RTP extension header (ID=1)
    • Pool size: 1-100 (local), 100-500 (Docker), 500-1500 (Kubernetes)
  4. Kubernetes Auto-Configuration:

    • Pods auto-detect partition ID from pod name: orchestrator-3PARTITION_ID=3
    • Port allocation: base_port + (partition_id × 10000) + participant_index
    • Example: Partition 0 uses 5000-14999, Partition 1 uses 15000-24999
    • StatefulSet with 10 replicas, each handling ~150 participants
    • Resources: 1-4 CPU, 2-4Gi memory per pod
    • Auto-configures based on host machine specs
  5. UDP Relay Chain (Kubernetes only):

    Orchestrator Pods (10×) → UDP :5000 → udp-relay Pod (Python)
    → Length-Prefixed TCP :5001 → kubectl port-forward 15001:5001
    → tools/udp-relay (Go) → UDP :5002 → Your Receiver
    
    • Why: kubectl port-forward only supports TCP, not UDP
    • In-cluster relay: Python script aggregates UDP from all pods, streams as TCP with 2-byte length prefix
    • Local relay: Go tool converts TCP stream back to UDP packets
    • Aggregates 1500 participant streams into single connection
  6. WebRTC Infrastructure:

    • Coturn StatefulSet: 3 initial replicas, HPA scales 1-10 based on load (~500 participants/replica)
    • coturn-lb Service: Load balances TURN traffic across replicas
    • webrtc-connector: Optional proxy layer (Deployment + HPA 2-10 replicas), handles SDP signaling
    • Docker Mode: Single Coturn container for local testing
    • Ports: 3478 (TURN), 49152-65535 (relay range)
    • Credentials: webrtc/webrtc123
  7. Client Integration:

    • UDP Receiver: Receives aggregated RTP stream from all participants via relay chain
    • WebRTC Receiver: Establishes 1:1 WebRTC connections via SDP exchange through TURN servers
    • Both forward to your video call system under test (SFU/MCU/Mesh)
  8. Observability Stack (Optional):

    • Prometheus: Scrapes /metrics endpoint from all orchestrator pods every 5s
    • Grafana: Visualizes metrics via pre-configured dashboard (admin/admin)
    • Metrics exposed: participant count, packets sent, bytes sent, active spikes, packet loss %, jitter, MOS score
    • Access: Prometheus on :30090, Grafana on :30030 (NodePort)
    • Orchestrator pods annotated for auto-discovery: prometheus.io/scrape: "true"

Each virtual participant generates real media streams:

  • Video: H.264 NAL units from actual video files, packetized per RFC 6184
  • Audio: Opus frames from Ogg containers, packetized per RFC 7587
  • RTP: Standards-compliant headers with participant ID extensions
  • Timing: Frame-accurate timing (30fps video, 20ms audio packets)

Five spike types simulate real-world network conditions:

  • Packet Loss: Drops RTP packets at application layer (1-100%)
  • Network Jitter: Adds latency variation (base + gaussian jitter)
  • Bitrate Reduction: Throttles video encoding (30-80% reduction)
  • Frame Drops: Skips video frames (10-60% drop rate)
  • Bandwidth Limiting: Caps total throughput

Spikes are distributed across test duration using configurable strategies:

  • Even: Uniform spacing with jitter (predictable load)
  • Random: Unpredictable timing (realistic chaos)
  • Front-loaded: Dense spikes early (recovery testing)
  • Back-loaded: Baseline then chaos (comparison testing)
  • Legacy: Fixed interval ticker (runtime injection)

Kubernetes deployments use participant partitioning for horizontal scaling:

  • Each pod handles participant_id % total_partitions == partition_id
  • Port allocation: base_port + (partition_id * 10000) + participant_index
  • Automatic load distribution across 1-10 pods
  • Scales to 1500+ participants (150 per pod)

1. Local Development (Native Go)

Best for: Development, debugging, small-scale tests (1-100 participants)

# Start orchestrator
go run cmd/main.go

# In another terminal: Start UDP receiver
go run examples/go/udp_receiver.go 5002

# Edit config/config.json to set num_participants: 10
# Run chaos test
go run tools/chaos-test/main.go -config config/config.json

What happens:

  • Single orchestrator process on :8080
  • Participants send UDP to 127.0.0.1:5002
  • Chaos spikes injected via HTTP API
  • Real-time metrics displayed every 2s

Configuration (config/config.json):

{
  "base_url": "http://localhost:8080",
  "media_path": "public/rick-roll.mp4",
  "num_participants": 10,
  "duration_seconds": 300,
  "spikes": {
    "count": 20,
    "interval_seconds": 5,
    "types": { "rtp_packet_loss": {...}, "network_jitter": {...} }
  },
  "spike_distribution": {
    "strategy": "random",
    "min_spacing_seconds": 5,
    "jitter_percent": 15
  }
}

2. Docker Compose (Containerized)

Best for: Isolated testing, CI/CD, medium-scale tests (100-500 participants)

Prerequisites:

  • Docker Desktop with 8-16GB memory allocation
  • docker-compose installed
# Build and start orchestrator container
./scripts/start_everything.sh build

# In another terminal: Start UDP receiver
go run examples/go/udp_receiver.go 5002

# Edit config/config.json to set num_participants: 100
# Run chaos test (targets container)
go run tools/chaos-test/main.go -config config/config.json

Resource Limits (edit docker-compose.yaml):

services:
  orchestrator:
    deploy:
      resources:
        limits:
          cpus: "14.0"
          memory: 6G  # Increase for more participants

Scaling Guide:

Docker Memory Max Participants CPU Cores
8 GB ~100 4
16 GB ~250 8
24 GB ~400 12
32 GB ~500 14

3. Kubernetes with Nix (Production Scale)

Best for: Large-scale tests (500-1500 participants), horizontal scaling, production validation

Prerequisites:

  • Nix with flakes enabled
  • Docker Desktop or kind cluster
  • kubectl configured

Step 1: Enter Nix Environment

# Nix provides: Go, Docker, kubectl, kind, ffmpeg
nix develop

# Or use direnv for auto-activation
echo "use flake" > .envrc
direnv allow

Step 2: Deploy to Kubernetes

# Auto-deploy with optimal settings (detects system resources)
./scripts/start_everything.sh run -config config/config.json

# Or specify custom media files
./scripts/start_everything.sh run --media=path/to/video.mp4 -config config/config.json

What happens:

  1. Builds Docker image with Nix-provided Go toolchain
  2. Creates/uses kind cluster
  3. Deploys StatefulSet with 10 orchestrator pods
  4. Deploys UDP relay pod
  5. Sets up kubectl port-forward for UDP relay
  6. Starts local TCP→UDP relay
  7. Runs chaos test across all pods

Step 3: Receive Aggregated UDP Stream

Option A: UDP Receiver (Recommended for Kubernetes)

# Receives aggregated stream from all 1500 participants
go run ./examples/go/udp_receiver.go 5002

Option B: WebRTC Receiver (Multiple Participants)

# Connect to up to 150 participants via WebRTC
go run ./examples/go/webrtc_receiver.go http://localhost:8080 <test_id> 150

Architecture Flow:

1500 Participants across 10 pods
  → Each pod: 150 participants
  → Partition by participant_id % 10
  → All send UDP to udp-relay:5000
  → UDP relay aggregates → TCP :5001
  → kubectl port-forward 15001:5001
  → Local relay converts TCP → UDP :5002
  → Your receiver gets all 1500 streams

Note: The start_everything.sh script automatically sets up:

  • kubectl port-forward (udp-relay 15001:5001)
  • Local TCP→UDP relay (tools/udp-relay)
  • You only need to run the receiver
# Build and load image
docker build -t chaos-monkey-orchestrator:latest .
kind load docker-image chaos-monkey-orchestrator:latest

# Deploy
kubectl apply -f k8s/orchestrator/orchestrator.yaml
kubectl apply -f k8s/udp-relay/udp-relay.yaml

# Wait for pods
kubectl wait --for=condition=ready pod -l app=orchestrator --timeout=300s

# Port-forward UDP relay
kubectl port-forward udp-relay 15001:5001 &

# Start local TCP→UDP relay
go run tools/udp-relay/main.go &

# In another terminal: Start receiver
go run ./examples/go/udp_receiver.go 5002

# In another terminal: Run chaos test
go run tools/chaos-test/main.go -config config/config.json
# Delete Kubernetes resources
./scripts/cleanup.sh

# Or delete entire cluster
kind delete cluster --name av-chaos-monkey

Cross-Platform Builds with Nix

# Build for Linux x86_64 (most common)
nix build .#packages.x86_64-linux.av-chaos-monkey

# Build for ARM64 (Raspberry Pi, AWS Graviton)
nix build .#packages.aarch64-linux.av-chaos-monkey

# Build for macOS Intel
nix build .#packages.x86_64-darwin.av-chaos-monkey

# Build for macOS Apple Silicon
nix build .#packages.aarch64-darwin.av-chaos-monkey

# Binary location
./result/bin/main
# Create test
POST /api/v1/test/create
{
  "test_id": "optional_id",
  "num_participants": 100,
  "video": {...},
  "audio": {...},
  "duration_seconds": 600,
  "spikes": [...],
  "spike_distribution": {
    "strategy": "even",
    "min_spacing_seconds": 5,
    "jitter_percent": 15
  }
}

# Start test
POST /api/v1/test/{test_id}/start

# Get metrics
GET /api/v1/test/{test_id}/metrics

# Stop test
POST /api/v1/test/{test_id}/stop
# Get SDP offer
GET /api/v1/test/{test_id}/sdp/{participant_id}

# Set SDP answer
POST /api/v1/test/{test_id}/sdp/{participant_id}
{"sdp_answer": "v=0..."}
# Inject spike
POST /api/v1/test/{test_id}/spike
{
  "spike_id": "unique_id",
  "type": "rtp_packet_loss",
  "duration_seconds": 30,
  "participant_ids": [1001, 1002],
  "params": {"loss_percentage": "15"}
}
Type Parameters Effect
rtp_packet_loss loss_percentage (0-100) Drops packets at RTP layer
network_jitter base_latency_ms, jitter_std_dev_ms Adds delay variation
bitrate_reduce new_bitrate_kbps Throttles video encoding
frame_drop drop_percentage (0-100) Skips video frames
bandwidth_limit bandwidth_kbps Caps total throughput
{
  "spike_distribution": {
    "strategy": "even",
    "min_spacing_seconds": 5,
    "jitter_percent": 15,
    "respect_min_offset": true
  }
}
# Provided receiver with RTP parsing
go run examples/go/udp_receiver.go 5002

Output:

Listening for RTP packets on UDP port 0.0.0.0:5002
Packet #100 from 127.0.0.1:xxxxx:
  Participant ID: 1001
  Payload Type: 96 (H.264 video)
  Sequence: 1234
  Timestamp: 90000
  SSRC: 1001000
  Payload Size: 1200 bytes

═══════════════════════════════════════════════════════════
                    PACKET STATISTICS                       
═══════════════════════════════════════════════════════════
Duration: 60s
Total Packets: 180000 (3000 pkt/s)
Total Bytes: 450 MB (60 Mbps)

Media Type Breakdown:
  Video (H.264): 120000 packets (66.7%)
  Audio (Opus):  60000 packets (33.3%)

Unique Streams (SSRCs): 1500
Unique Participants: 1500
# Single participant
go run ./examples/go/webrtc_receiver.go http://localhost:8080 <test_id>

# Multiple participants (up to 150)
go run ./examples/go/webrtc_receiver.go http://localhost:8080 <test_id> 150

# Example with actual test ID
go run ./examples/go/webrtc_receiver.go http://localhost:8080 chaos_test_1770831684 150

Note: WebRTC requires 1:1 connections. For Kubernetes, use UDP receiver which aggregates all participants automatically.

RTP Packet Format:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X|  CC   |M|     PT      |       sequence number         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           timestamp                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           synchronization source (SSRC) identifier            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Extension ID=1 | Length=4    |    Participant ID (uint32)    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         H.264/Opus Payload                    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Payload Types:

  • 96: H.264 video (RFC 6184)
  • 111: Opus audio (RFC 7587)

Participant ID Extraction:

// Extension bit set?
if (packet[0] & 0x10) != 0 {
    offset := 12 + int(packet[0]&0x0F)*4  // Skip CSRC
    extID := binary.BigEndian.Uint16(packet[offset:])
    if extID == 1 {
        participantID := binary.LittleEndian.Uint32(packet[offset+4:])
    }
}
Participants Memory CPU Bandwidth
100 2GB 2 cores 250 Mbps
500 6GB 8 cores 1.2 Gbps
1000 12GB 16 cores 2.5 Gbps
1500 18GB 24 cores 3.7 Gbps
  • Auto-scaling: Calculates optimal pod count based on participant count
  • Pod capacity: 150 participants per pod (configurable)
  • Max pods: 10 (StatefulSet limit)
  • Port range: 10,000 ports per partition

Per participant (1280x720@30fps + Opus):

  • Video: ~2.5 Mbps (H.264)
  • Audio: ~128 Kbps (Opus)
  • Total: ~2.6 Mbps
  • Packets: ~90 video + 50 audio = 140 pkt/s
# Exposed on /metrics endpoint
av_chaos_monkey_participants_total
av_chaos_monkey_packets_sent_total
av_chaos_monkey_bytes_sent_total
av_chaos_monkey_spikes_active
av_chaos_monkey_packet_loss_percent
av_chaos_monkey_jitter_ms
# Docker Mode: Start monitoring stack
docker-compose --profile monitoring up

# Kubernetes Mode: Deploy monitoring
kubectl apply -f k8s/monitoring/prometheus-rbac.yaml
kubectl apply -f k8s/monitoring/prometheus.yaml
kubectl apply -f k8s/monitoring/grafana.yaml

# Access Grafana
# Docker: http://localhost:3000
# Kubernetes: http://localhost:30030 (NodePort)
# Default credentials: admin/admin

# Access Prometheus
# Docker: http://localhost:9091
# Kubernetes: http://localhost:30090 (NodePort)

Kubernetes Auto-Discovery:

  • Orchestrator pods annotated with prometheus.io/scrape: "true"
  • Prometheus scrapes /metrics from all pods every 5s
  • Grafana pre-configured with Prometheus datasource
  • Dashboard auto-provisioned on startup
# Get test metrics
curl http://localhost:8080/api/v1/test/{test_id}/metrics | jq

# Output
{
  "aggregate": {
    "total_frames_sent": 45000,
    "total_packets_sent": 180000,
    "total_bitrate_kbps": 250000,
    "avg_jitter_ms": 12.5,
    "avg_packet_loss": 2.3,
    "avg_mos_score": 4.1
  }
}
# Check UDP target configuration
kubectl logs orchestrator-0 | grep "UDP transmission enabled"

# Verify UDP relay is running
kubectl get pod udp-relay

# Check port-forward
ps aux | grep "kubectl port-forward"

# Test UDP connectivity
nc -u -z localhost 5002
# Check TURN server
kubectl get svc coturn-lb

# Verify ICE candidates
kubectl logs orchestrator-0 | grep "ICE"

# Test TURN connectivity
turnutils_uclient -v -u webrtc -w webrtc123 <turn-server>:3478
# Check participant count per pod
kubectl exec orchestrator-0 -- curl -s http://localhost:8080/api/v1/test/{test_id}/metrics | jq '.participants | length'

# Scale down participants or increase pod count
go run tools/k8s-start/main.go -replicas 10 -participants 1000

# Increase Docker memory (Docker Desktop)
# Settings → Resources → Memory → 16GB

Packet Loss in UDP Receiver

Single UDP socket cannot handle 3000+ concurrent streams without kernel buffer overflow. Solutions:

  • Use UDP relay (aggregates before forwarding)
  • Increase socket buffer: setsockopt(SO_RCVBUF, 8MB)
  • Accept baseline loss as measurement artifact

BSD 3-Clause License

Contributions welcome! Key areas:

  • Additional spike types (CPU throttling, memory pressure)
  • More distribution strategies (wave, burst)
  • Enhanced metrics (MOS calculation, RTCP feedback)
  • Client libraries (Python, Rust, TypeScript)
联系我们 contact @ memedata.com