将PostgreSQL用作事件驱动系统的死信队列

将PostgreSQL用作事件驱动系统的死信队列
Using PostgreSQL as a Dead Letter Queue for Event-Driven Systems

原始链接: https://www.diljitpr.net/blog-post-postgresql-dlq

## PostgreSQL 中的死信队列，用于可靠的事件处理在构建 Wayfair 使用 Kafka 和 CloudSQL PostgreSQL 的每日业务报告系统时，团队面临着分布式系统不可避免的故障带来的挑战——API 故障、消费者崩溃和格式错误的事件。传统的 Kafka 死信队列 (DLQ) 由于缺乏检查和重试的可见性而证明不足。解决方案是利用 PostgreSQL *作为* DLQ。失败的事件以及故障上下文直接持久化到专用的 `dlq_events` 表中，标记为“PENDING”。这提供了一个可查询、可审计的故障记录。一个由 ShedLock 提供支持的重试调度器，定期扫描此表以查找符合条件的事件，以防止重复处理。 DLQ 表设计用于高效查询，并在状态、重试时间和事件类型上设置索引。`retry_count` 和 `retry_after` 字段可以防止在故障期间进行激进的重试。成功的重试会将事件状态更新为“SUCCEEDED”。这种方法将故障处理从破坏性转变为可预测性，使工程师能够轻松检查、调试和重新处理事件。Kafka 继续处理高吞吐量摄取，而 PostgreSQL 在持久性和可观察性方面表现出色。最终，将 PostgreSQL 视为 DLQ 创建了一个具有弹性和易于操作的管道，使故障成为系统可管理的组成部分。

## PostgreSQL 作为死信队列：摘要这次 Hacker News 的讨论集中在使用 PostgreSQL 作为事件驱动系统的死信队列 (DLQ)。核心思想，在链接的博客文章中有详细说明，建议利用 PostgreSQL 的特性——例如 `FOR UPDATE SKIP LOCKED`——创建一个简单、持久且可查询的队列，而无需引入额外的基础设施。许多评论者认为，对于每天处理少于几亿事件的业务应用程序来说，这是一种实用的默认方案，它具有动态优先级调整和易于队列检查等优点。然而，也存在对可扩展性和故障进程可能压垮 DLQ 的担忧，因此需要断路器或速率限制。这场辩论凸显了简单性和健壮性之间的权衡。虽然专门的队列系统（如 Kafka）提供更高的可扩展性，但 PostgreSQL 可以作为一个可行的起点，尤其是在团队已经管理 PostgreSQL 实例的情况下。讨论强调了监控 DLQ 数量的重要性，因为持续的高数量表明上游存在问题，而不是架构缺陷。最终，选择取决于特定应用程序的规模和需求。

While I was working on a project with Wayfair, I got the opportunity to work on a system that generated daily business reports aggregated from multiple data sources flowing through event streams across Wayfair. At a high level, Kafka consumers listened to these events, hydrated them with additional data by calling downstream services, and finally persisted the enriched events into a durable datastore—CloudSQL PostgreSQL on GCP.

When everything was healthy, the pipeline worked exactly as expected. Events flowed in, got enriched, and were stored reliably. The real challenge started when things went wrong, which, in distributed systems, is not an exception but a certainty.

There were multiple failure scenarios we had to deal with. Sometimes the APIs we depended on for hydration were down or slow. Sometimes the consumer itself crashed midway through processing. In other cases, events arrived with missing or malformed fields that could not be processed safely. These were all situations outside our direct control, but they still needed to be handled gracefully.

This is where the concept of a Dead Letter Queue came into the picture. Whenever we knew an event could not be processed successfully, instead of dropping it or blocking the entire consumer, we redirected it to a DLQ so it could be inspected and potentially reprocessed later.

Our first instinct was to use Kafka itself as a DLQ. While this is a common pattern, it quickly became clear that it wasn't a great fit for our needs. Kafka is excellent for moving data, but once messages land in a DLQ topic, they are not particularly easy to inspect. Querying by failure reason, retrying a specific subset of events, or even answering simple questions like "what failed yesterday and why?" required extra tooling and custom consumers. For a system that powered business-critical daily reports, this lack of visibility was a serious drawback.

That's when we decided to treat PostgreSQL itself as the Dead Letter Queue.

Instead of publishing failed events to another Kafka topic, we persisted them directly into a DLQ table in PostgreSQL. We were already using CloudSQL as our durable store, so operationally this added very little complexity. Conceptually, it also made failures first-class citizens in the system rather than opaque messages lost in a stream.

Whenever an event failed processing—due to an API failure, consumer crash, schema mismatch, or validation error—we stored the raw event payload along with contextual information about the failure. Each record carried a simple status field. When the event first landed in the DLQ, it was marked as PENDING. Once it was successfully reprocessed, the status was updated to SUCCEEDED. Keeping the state model intentionally minimal made it easy to reason about the lifecycle of a failed event.

DLQ Table Schema and Indexing Strategy

To support inspection, retries, and long-term operability, the DLQ table was designed to be simple, query-friendly, and retry-aware.

Table Schema

CREATE TABLE dlq_events (
    id BIGSERIAL PRIMARY KEY,
    event_type VARCHAR(255) NOT NULL,
    payload JSONB NOT NULL,
    error_reason TEXT NOT NULL,
    error_stacktrace TEXT,
    status VARCHAR(20) NOT NULL, -- PENDING / SUCCEEDED
    retry_count INT NOT NULL DEFAULT 0,
    retry_after TIMESTAMP WITH TIME ZONE NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);

Key Design Considerations

payload is stored as JSONB to preserve the raw event without enforcing a rigid schema.
status keeps the lifecycle simple and explicit.
retry_after prevents aggressive retries when downstream systems are unstable.
retry_count allows retry limits to be enforced without external state.
Timestamps make auditing and operational analysis straightforward.

Indexes

CREATE INDEX idx_dlq_status
ON dlq_events (status);

CREATE INDEX idx_dlq_status_retry_after
ON dlq_events (status, retry_after);

CREATE INDEX idx_dlq_event_type
ON dlq_events (event_type);

CREATE INDEX idx_dlq_created_at
ON dlq_events (created_at);

These indexes allow the retry scheduler to efficiently locate eligible events while still supporting fast debugging and time-based analysis without full table scans.

DLQ Retry Mechanism with ShedLock

Persisting failed events solved the visibility problem, but we still needed a safe and reliable way to retry them.

For this, we introduced a DLQ retry scheduler backed by ShedLock. The scheduler periodically scans the DLQ table for PENDING events that are eligible for retry and attempts to process them again. Since the service runs on multiple instances, ShedLock ensures that only one instance executes the retry job at any given time. This eliminates duplicate retries without requiring custom leader-election logic.

Retry Configuration

dlq:
  retry:
    enabled: true
    max-retries: 240
    batch-size: 50
    fixed-rate: 21600000 # 6 hours in milliseconds

How Retries Work

The scheduler runs every six hours.
Up to fifty eligible events are picked up per run.
Events exceeding the maximum retry count are skipped.
Successful retries immediately transition the event status to SUCCEEDED.
Failures remain in PENDING and are retried in subsequent runs.

Query Implementation

The retry scheduler uses a SQL query with FOR UPDATE SKIP LOCKED to safely select eligible events across multiple instances. This PostgreSQL feature ensures that even if multiple scheduler instances run simultaneously, each will pick up different rows without blocking each other:

@QueryHints(@QueryHint(name = "jakarta.persistence.lock.timeout", value = "-2"))
@Query(
    value = "SELECT * FROM dlq_table "
        + "WHERE messagetype = :messageType "
        + "AND retries

The FOR UPDATE SKIP LOCKED clause is crucial here. It allows each instance to lock and process different rows concurrently, preventing duplicate processing while maintaining high throughput. The query hint sets the lock timeout to -2, which means "wait indefinitely" but combined with SKIP LOCKED, it effectively means "skip any rows that are already locked by another transaction."

This setup allowed the system to tolerate long downstream outages while avoiding retry storms and unnecessary load on dependent services.

Operational Benefits

With this approach, failures became predictable and observable rather than disruptive. Engineers could inspect failures using plain SQL, identify patterns, and reprocess only the events that mattered. If a downstream dependency was unavailable for hours or even days, events safely accumulated in the DLQ and were retried later without human intervention. If an event was fundamentally bad, it stayed visible instead of being silently dropped.

Most importantly, this design reduced operational stress. Failures were no longer something to fear; they were an expected part of the system with a clear, auditable recovery path.

My Thoughts

The goal was never to replace Kafka with PostgreSQL. Kafka remained the backbone for high-throughput event ingestion, while PostgreSQL handled what it does best—durability, querying, and observability around failures. By letting each system play to its strengths, we ended up with a pipeline that was resilient, debuggable, and easy to operate.

In the end, using PostgreSQL as a Dead Letter Queue turned failure handling into something boring and predictable. And in production systems, boring is exactly what you want.