PostgreSQL生产事故由事务ID环绕引起

PostgreSQL生产事故由事务ID环绕引起
PostgreSQL production incident caused by transaction ID wraparound

原始链接: https://www.sqlservercentral.com/articles/i-too-have-a-production-story-a-downtime-caused-by-postgres-transaction-id-wraparound-problem

## PostgreSQL 事务ID环绕：生产事件总结本文详细描述了PostgreSQL生产环境中由于事务ID（XID）环绕引起的严重且隐蔽的故障——当用于事务ID的有限计数器接近其上限时的情况。与典型的宕机不同，这不是由负载或配置更改引起的，而是由于时间流逝，旧数据没有充分“冻结”造成的。 PostgreSQL使用MVCC，为每个写入事务分配XID。为了安全地重用这些ID，必须冻结较旧的行版本，将其标记为永久可见。如果autovacuum（负责冻结的进程）被禁用或无效，XID会累积，最终达到安全阈值（大约20亿）。此时，PostgreSQL会停止所有写入操作以防止数据损坏，从而有效地使数据库变为只读。该事件发生在稳定的系统上，负载适中（每秒10次写入），并且autovacuum在几年前被禁用作为临时性能修复，但从未重新启用。恢复涉及手动强制冻结，但根本原因揭示了其他系统也存在类似风险。关键要点是，XID环绕不是性能问题，而是基本的安全机制。它可以根据写入速率和时间进行预测，并且很容易在测试中被忽略。定期监控`relfrozenxid`并确保autovacuum已启用对于防止这种隐蔽的、潜在灾难性的故障至关重要。与使用不同版本系统的SQL Server不同，PostgreSQL *需要*主动冻结才能维护数据完整性。

最近PostgreSQL生产环境发生一起事故，在Hacker News上讨论，起因是禁用了`autovacuum`功能。最初，`autovacuum`被怀疑导致了之前的问题，因此在没有充分理解长期后果的情况下被关闭。核心问题是事务ID环绕——一个可以通过定期vacuuming预防的已知问题。评论该帖子的用户指出，这是一个有充分记录的问题，可以通过适当的监控和维护轻松避免，特别是确保`autovacuum`保持启用并为高写入表配置。许多评论者对这篇文章表示沮丧，认为它报道了一个显而易见的问题，甚至质疑其真实性，有人认为它部分由AI生成。评论中反复出现的一个笑话是讽刺SQL Server被认为是优越的。

原文

Introduction

This article describes a real PostgreSQL production incident caused by transaction ID wraparound, a failure mode that is both silent and severe. The incident ultimately resulted in a complete write outage. The failure did not occur immediately after a configuration change, nor was it triggered by high load, traffic growth, or infrastructure problems.

In PostgreSQL, every write transaction is assigned a transaction ID (XID). These transaction IDs are drawn from a finite, global counter that advances continuously as transactions are executed. To safely reuse transaction IDs, PostgreSQL requires that old row versions be periodically frozen. Freezing marks data as permanently visible, preventing transaction IDs from aging indefinitely. If freezing does not occur in time, transaction IDs continue to advance toward a hard safety limit.

What makes transaction ID wraparound particularly dangerous is not merely its existence, but the way it manifests. A database can operate normally for months or even years while silently approaching the limit. CPU usage appears healthy. I/O patterns remain stable. Query performance shows no gradual degradation. There are no warning signs that resemble traditional capacity or performance issues.

When the safety limit is finally reached, PostgreSQL has no safe way to continue accepting write transactions without risking data corruption. At that point, PostgreSQL intentionally blocks all write activity. The database effectively becomes read-only, and recovery shifts from routine tuning to an urgent operational incident. The incident described in this article occurred in an environment with a stable and modest workload. There was no traffic spike, no abnormal query behavior, and no operational change immediately preceding the outage. The failure occurred simply because sufficient time had passed without transaction ID freezing being completed.

It is also important to note that this class of failure cannot be meaningfully reproduced through short-lived simulations or conventional performance testing. Transaction ID wraparound is governed by cumulative transaction counts and long-term aging of data. In most environments, it emerges only after months or years of normal operation, making it easy to overlook during testing, staging validation, or initial production rollout.

The purpose of this article is to explain what transaction ID wraparound is, why it is particularly dangerous in production environments, how configuration decisions made long ago can silently lead to it, and why understanding the underlying transaction ID math is essential for running PostgreSQL safely in production.

How Transaction IDs and the Wraparound Problem Work in PostgreSQL

PostgreSQL uses multi-version concurrency control (MVCC) to manage concurrent access to data. Instead of updating rows in place, each change creates a new version of the row while older versions remain until they are no longer needed. To determine which row versions are visible, PostgreSQL assigns a transaction ID (XID) to every write transaction. That XID is stored with the row version and compared against transaction snapshots during reads. Transaction IDs are global across the entire PostgreSQL cluster and are allocated sequentially from a 32-bit counter, making the total XID space finite.

Although the counter can technically represent about four billion values, PostgreSQL enforces a much earlier safety boundary. When the age of the oldest unfrozen transaction ID approaches approximately 2 billion, PostgreSQL considers the system unsafe.

To prevent data corruption caused by transaction ID reuse, PostgreSQL relies on freezing. During vacuum operations, sufficiently old row versions have their transaction IDs replaced with a special frozen marker, making them permanently visible and safe. If freezing does not happen in time and the 2-billion threshold is reached, PostgreSQL deliberately blocks all write operations. INSERT, UPDATE, DELETE, and many DDL commands begin to fail. The database effectively becomes read-only until the issue is resolved.

This behavior is intentional and protective. The system may appear healthy for months or years, but once the limit is crossed, the failure is sudden and absolute.

A Little of Backstory

The system belonged to a mid-sized B2B SaaS organization. PostgreSQL was the primary OLTP database and had been running in production for several years. Database ownership was largely developer-driven. The development team handled schema changes, performance tuning, and most configuration decisions. This worked reasonably well because the workload was stable and modest. The application workload characteristics were unremarkable:

Approximately 10 write transactions per second
Short-lived transactions with autocommit enabled
No batch processing
No reporting or analytics on the primary

From monitoring dashboards, the system appeared healthy. CPU, memory, and I/O usage were consistently within normal ranges.

A Configuration Decision Made Earlier

Several months before I joined the organization, the team experienced a performance issue related to disk I/O. Autovacuum activity was visible during the incident and was assumed to be contributing to the problem. As a temporary mitigation, autovacuum was disabled on several tables, and in some cases at broader scope. The immediate performance issue was resolved. The system stabilized. Development moved forward.

Autovacuum, however, was never re-enabled. Over time, this configuration became accepted as normal. The databases continued to function, and no immediate problems surfaced. The long-term implications of disabling autovacuum were not actively considered.

When I joined the organization, there was no active database incident. Around that time, I took on core database responsibilities. As expected, my initial focus was on foundational operational work. I validated backups, tested restore procedures, reviewed disaster recovery readiness, and checked replication and failover behavior. These were necessary and correct priorities. The databases had been running for years without visible issues, and there was no immediate signal that something fundamental was wrong.

Transaction ID wraparound health was not reviewed during that initial phase.

The Incident: One Month Later

Approximately one month after I joined and took core responsibility, the incident occurred. The application team reported that write operations were failing. INSERT statements failed. UPDATE statements failed. DDL statements failed.

PostgreSQL had entered transaction ID wraparound protection mode. The database was effectively read-only. At this point, the system was already past the warning phase. PostgreSQL had made a deliberate decision to protect data correctness.

Recovery Under Pressure

The immediate goal was to restore write capability. This was not a tuning exercise; it was a recovery operation. The work involved identifying and terminating long-running or idle transactions that were holding old snapshots, followed by aggressive manual VACUUM FREEZE operations on the most affected tables.

The process was stressful and messy. Freezing had to be pushed hard enough to advance transaction IDs beyond the danger zone while keeping the system stable. Eventually, sufficient freezing was completed, write operations resumed, and the system stabilized.

The Realization: This Was Not an Isolated Case

Once the immediate incident was resolved, it became clear that this could not be treated as a one-off problem.The root cause was not a recent change. It was a configuration decision made years earlier. I started checking other PostgreSQL production systems owned by the same teams. The first query I ran was simple:

SELECT relname, age(relfrozenxid)
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC;

The results were concerning. Multiple production databases had tables with very high transaction ID age. In many cases, autovacuum was explicitly disabled on those tables. This was the moment when the seriousness of the situation became clear. The initial incident was not bad luck. It was an early warning.

How Autovacuum Prevents Transaction ID Wraparound

In normal operation, PostgreSQL relies on autovacuum to prevent transaction ID wraparound. Autovacuum periodically scans tables to remove dead row versions and freeze old rows, ensuring transaction IDs do not age indefinitely.

Each table in PostgreSQL tracks a value, called relfrozenxid, which represents the oldest transaction ID in that table that has not yet been frozen. As transactions continue across the cluster, the age of relfrozenxid increases unless vacuuming successfully freezes older row versions. The current state of transaction ID aging can be observed using the following query:

SELECT relname, age(relfrozenxid)
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC;


PostgreSQL enforces a hard safety threshold controlled by:

To protect against transaction ID reuse, PostgreSQL enforces a hard safety threshold controlled by the parameter:

SHOW autovacuum_freeze_max_age;

By default, this value is 200,000,000. When the age of a table’s relfrozenxid approaches this limit and freezing cannot progress, PostgreSQL deliberately blocks write operations to prevent data corruption.

If autovacuum is disabled or prevented from running effectively, freezing does not occur automatically. Transaction IDs continue to age silently until the safety threshold is reached, at which point the issue surfaces abruptly as a production outage.

The Math: Why the Failure Was not instant, but Inevitable

Transaction ID consumption depends only on write rate and time. In this system:

10 transactions / second
Transactions per minute: 10 × 60 = 600
Transactions per hour: 600 × 60 = 36,000
Transactions per day: 36,000 × 24 = 864,000

So, Transaction IDs consumed per day: 864,000. The PostgreSQL wraparound safety threshold is 200,000,000. The time to reach wraparound risk: 200,000,000 ÷ 864,000 ˜ 231 days. That is approximately 7.5 months.

No spike was required. No growth was required. Time alone was sufficient.

Unused Tables also cause the wraparound problem

One important detail that made this incident worse was the presence of tables that were no longer in active use. In PostgreSQL, transaction IDs increase globally for the entire database, but freezing is handled at the table level. This difference is easy to miss.

In this environment, several tables had been created earlier for testing or for features that were later abandoned. Because those tables were considered “not really in use,” autovacuum had been disabled on them. Once autovacuum was disabled, those tables stopped being frozen. Their transaction ID state remained fixed at the point of their last insert or vacuum.

Meanwhile, the rest of the application continued to run normally. Write transactions kept happening on other tables, and transaction IDs continued to advance every day. After enough time passed, one of these unused tables ended up having the oldest unfrozen transaction ID in the entire database. At that point, PostgreSQL treated it as a wraparound risk for the whole system.

The key point is simple: a table does not need to be actively used to be dangerous. A single forgotten test table with autovacuum disabled is enough to trigger transaction ID wraparound.

Why SQL Server Does Not Face This Problem

SQL Server does not experience this specific failure mode because it does not rely on a finite, globally aging transaction ID in the way PostgreSQL does.

In PostgreSQL, row versions store transaction IDs that come from a finite global counter. Those IDs must eventually be reused, which is why PostgreSQL requires freezing. If freezing does not occur in time, PostgreSQL must stop write operations to avoid data corruption.

SQL Server uses a different internal approach. Transaction ordering and version visibility are based on Log Sequence Numbers (LSNs) rather than reusable transaction IDs. LSNs are monotonically increasing values generated by the transaction log and are not reused in a way that creates wraparound ambiguity.

When SQL Server uses row versioning (such as snapshot isolation or read committed snapshot), older row versions are tracked using LSNs and stored in tempdb. Cleanup depends on active transactions and log truncation, not on reaching a global identifier reuse limit. If SQL Server cannot clean up old versions, the impact is operational rather than absolute. tempdb may grow, performance may degrade, or blocking may increase. However, SQL Server does not need to halt all write operations to preserve correctness.

In short, PostgreSQL must enforce a hard stop when transaction ID reuse becomes unsafe, while SQL Server avoids this specific problem by using an LSN-based design instead of finite, reusable transaction IDs.

Conclusion

This production incident was not caused by load, traffic growth, or inefficient queries. It was caused by transaction ID freezing not occurring over an extended period of time. The databases appeared healthy for years. The workload was stable. Monitoring showed nothing unusual. The risk accumulated quietly as a function of time.

The incident surfaced only after I had taken core responsibility, and the recovery required aggressive and careful intervention. Subsequent checks showed that the same risk existed across other systems as well. PostgreSQL behaved exactly as designed. It protected data integrity by stopping writes before corruption could occur.

Transaction ID wraparound is not an edge case. It is a predictable outcome when freezing is ignored or misunderstood. Understanding the math behind transaction ID consumption and treating autovacuum as a safety mechanism rather than a tuning option is essential for running PostgreSQL in production.