pg_flight_recorder: Continuously sample PostgreSQL system state via pg_cron

原始链接: https://github.com/dventimisupabase/pg_flight_recorder

Sorry.
相关文章

原文

GitHub release Test Suite Lint

Server-side flight recorder for PostgreSQL. Answers "what was happening in my database?"

View the project website

pg_flight_recorder continuously samples PostgreSQL system state in the background via pg_cron -- no external agents, sidecars, or polling required. It captures wait events, active sessions, locks, WAL activity, checkpoints, I/O, table and index stats, query performance, replication state, and configuration changes. When something goes wrong, the data is already there.

Flight Recorder collects two types of data:

System What it captures Frequency Retention
Sampled Activity Wait events, sessions, locks 1 min Ring buffer: 2h, Archives: 7d
Snapshots WAL, checkpoints, I/O, tables, indexes 1 min 30 days

Data flows through ring buffers (hot, low-overhead) into durable archives and aggregates (cold, long-retention). Safety mechanisms -- circuit breaker, load shedding, per-section timeouts, and pg_cron job timeouts -- prevent the recorder from impacting production workloads.

Two extensions, each published as a separate dbdev package:

  • PostgreSQL 15, 16, 17, or 18
  • pg_cron extension
  • Superuser privileges for installation
  • Optional: pg_stat_statements for query-level analysis

Download from GitHub Releases or clone the repo, then:

# Install core + optional analysis extension
psql --single-transaction -f pgfr_record/install.sql
psql --single-transaction -f pgfr_analyze/install.sql
-- Enable collection
SELECT pgfr_record.enable();

-- Check health
SELECT * FROM pgfr_record.health_check();

-- Generate a diagnostic report
SELECT pgfr_analyze.report('1 hour');
SELECT * FROM pgfr_record.health_check();
SELECT pgfr_analyze.report('1 hour');
-- Switch to high-frequency collection
SELECT * FROM pgfr_record.apply_profile('troubleshooting');

-- What was happening at a specific time?
SELECT * FROM pgfr_analyze.what_happened_at('2024-01-15 14:32');

-- Reconstruct an incident timeline
SELECT * FROM pgfr_analyze.incident_timeline(
    '2024-01-15 14:00'::timestamptz,
    '2024-01-15 15:00'::timestamptz
);

-- Return to normal after incident
SELECT * FROM pgfr_record.apply_profile('default');

XID / MultiXID wraparound monitoring

-- Current XID and MultiXID ages at database level (from the latest snapshot)
SELECT datfrozenxid_age, datminmxid_age
FROM pgfr_record.snapshots
ORDER BY captured_at DESC LIMIT 1;

-- Top 10 tables by XID or MultiXID age
SELECT relid::regclass, relfrozenxid_age, relminmxid_age
FROM pgfr_record.table_snapshots
WHERE snapshot_id = (SELECT max(id) FROM pgfr_record.snapshots)
ORDER BY greatest(relfrozenxid_age, relminmxid_age) DESC NULLS LAST
LIMIT 10;

-- Wraparound anomalies (XID + MultiXID, cluster + per-table)
SELECT anomaly_type, severity, metric_value, recommendation
FROM pgfr_analyze.anomaly_report(now() - interval '1 hour', now())
WHERE anomaly_type LIKE '%WRAPAROUND%';

-- Tune thresholds (lower warning ratio to alert earlier on busy clusters)
INSERT INTO pgfr_record.config (key, value) VALUES
    ('xid_warning_ratio',  '0.25'),
    ('mxid_warning_ratio', '0.25')
ON CONFLICT (key) DO UPDATE SET value = EXCLUDED.value;

See REFERENCE.md for the full config key list and rationale (postgres-howto #0044 guidance).

-- Find performance regressions
SELECT * FROM pgfr_analyze.detect_regressions('1 day');

-- Find query storms
SELECT * FROM pgfr_analyze.detect_query_storms('1 hour');

-- Table hotspots
SELECT * FROM pgfr_analyze.table_hotspots(now() - '1 day', now());

-- Unused indexes
SELECT * FROM pgfr_analyze.unused_indexes('7 days');
SELECT * FROM pgfr_analyze.capacity_summary('7 days');
SELECT * FROM pgfr_analyze.quarterly_review();
SELECT * FROM pgfr_analyze.capacity_dashboard;

Profiles are pre-configured settings for different environments:

Profile Sample Interval Use Case
default 60s General purpose monitoring
production_safe 300s Production with maximum safety margins
development 60s Staging and development
troubleshooting 60s Active incident response
minimal_overhead 300s Resource-constrained systems
SELECT * FROM pgfr_record.list_profiles();
SELECT * FROM pgfr_record.explain_profile('production_safe');
SELECT * FROM pgfr_record.apply_profile('production_safe');

Flight Recorder includes automatic protections:

Protection Description
Circuit Breaker Skips collection if recent runs averaged > 1s
Load Shedding Skips collection when > 70% connections active
Section Timeouts Per-query timeout (250ms) prevents catalog lock hangs
Job Timeouts Outer statement_timeout on pg_cron collector jobs (500ms–60s)

Collection modes provide manual control: normal, light, emergency.

-- Reduce to minimum collection (300s sampling, locks/progress off)
SELECT pgfr_record.set_mode('emergency');

-- Full stop: unschedule all pg_cron jobs
SELECT pgfr_record.disable();

-- Resume
SELECT pgfr_record.enable();

With default retention: ~2.5GB uncompressed, ~150MB compressed.

# Without compression
pg_dump -d your_database -n pgfr_record --data-only -f pgfr_data.sql

# With compression (PostgreSQL 16+)
pg_dump -d your_database -n pgfr_record --data-only --compress=gzip:9 -f pgfr_data.sql.gz

# With compression (PostgreSQL 15)
pg_dump -d your_database -n pgfr_record --data-only | gzip > pgfr_data.sql.gz

Re-running install scripts is safe -- they use CREATE OR REPLACE and IF NOT EXISTS, updating functions and views while preserving all data.

psql --single-transaction -f pgfr_record/install.sql
psql --single-transaction -f pgfr_analyze/install.sql
# Remove everything (stops jobs, drops all schemas and data)
psql --single-transaction -f pgfr_record/uninstall.sql

# Remove only reporting functions (keeps core + data)
psql --single-transaction -f pgfr_analyze/uninstall.sql
./test.sh           # Test all PostgreSQL versions in parallel (requires Docker)
./test.sh 17        # Test a specific PostgreSQL version (15, 16, 17, or 18)

See REFERENCE.md for the full function reference, table schemas, configuration settings, and detailed documentation.

联系我们 contact @ memedata.com