软删除的挑战

软删除的挑战
The challenges of soft delete

原始链接: https://atlas9.dev/blog/soft-delete.html

## 软删除策略：避免 `archived_at` 的陷阱许多软件项目使用“软删除”（标记记录为已删除，而不是永久删除）来实现数据恢复和合规性。然而，仅仅添加一个 `archived_at` 列可能会产生显著的复杂性，使数据库膨胀，包含很少访问的“死数据”，并使查询、迁移甚至备份变得复杂。作者反对 `archived_at` 方法，理由是Terraform导致的数据膨胀问题以及恢复已归档记录的困难。相反，他们探讨了替代方案： * **应用程序事件：** 删除会触发一个事件（例如，发送到SQS），异步地将数据归档到对象存储（如S3）。这简化了主数据库，但增加了基础设施复杂性和潜在的数据丢失风险。 * **触发器：** 将删除的行复制到单独的归档表（通常为JSON格式）可以保持实时表干净并简化清理。这是作者的首选起点。 * **变更数据捕获 (CDC)：** 利用PostgreSQL的预写式日志 (WAL) 将删除事件流式传输到外部系统（通过Debezium和Kafka等工具）。这避免了对主数据库的影响，但引入了大量的运营开销。最终，最佳方法取决于项目需求。作者建议从触发器开始，以保持简单，而CDC适用于现有的Kafka基础设施或多目的地流式传输。使用CDC时，仔细监控WAL保留时间至关重要，以防止数据库崩溃。

## 软删除的挑战：Hacker News 总结 Hacker News 上的一场讨论围绕着“软删除”的优缺点——标记记录为已删除，而不是永久从数据库中移除它们。虽然看似简单，但这种做法会带来复杂性。许多评论者，尤其是金融领域的人，倾向于使用软删除，因为它能提高模式清晰度，为分析和管理工具提供一致的数据访问，并方便潜在的恢复。他们还认为性能不是主要问题，尤其是在使用不可变数据方法和数据库触发器进行审计的情况下。然而，其他人则警告不要广泛采用。在删除率较高（50-70%）的情况下，性能*可能*会下降，而且软删除本身并不能捕获所有历史数据变化（例如更新）。讨论的解决方案包括数据库分区、利用数据库内置的时序旅行功能（如 Datomic），以及将变更数据捕获 (CDC) 写入单独的审计表。一个关键的结论是，软删除并非万能解决方案。在实施之前，仔细分析数据量、访问模式、法律要求（如 GDPR）以及对完整审计跟踪的需求至关重要。备份和时间点恢复也被认为是可行的替代方案。

原文

Software projects often implement "soft delete", maybe with a deleted boolean or an archived_at timestamp column. If customers accidentally delete their data, they can recover it, which makes work easier for customer support teams. Perhaps archived records are even required for compliance or audit reasons.

I've run into some trouble with soft delete designs. I'll cover those, and ponder ideas for how I'd build this in the future.

Adding an archived_at column seems to ooze complexity out into queries, operations, and applications. Recovering deleted records does happen, but 99% of archived records are never going to be read.

So, the database tables will have a lot of dead data. Depending on access patterns, that might even be a significant amount of data. I've seen APIs that didn't work well with Terraform, so Terraform would delete + recreate records on every run, and over time that led to millions of dead rows. Your database can probably handle the extra bytes, and storage is fairly cheap, so it's not necessarily a problem, at first.

Hopefully, the project decided on a retention period in the beginning, and set up a periodic job to clean up those rows. Unfortunately, I'd bet that a significant percentage of projects did neither – it's really easy to ignore the archived data for a long time.

At some point, someone might want to restore a database backup. Hopefully that's for fun and profit and not because you lost the production database at 11 am. If your project is popular, you might have a giant database full of dead data that takes a long time to recreate from a dump file.

archived_at columns also complicate queries, operations, and application code. Applications need to make sure they always avoid the archived data that's sitting right next to the live data. Indexes need to be careful to avoid archived rows. Manual queries run for debugging or analytics are longer and more complicated. There's always a risk that archived data accidentally leaks in when it's not wanted. The complexity grows when there are mapping tables involved.

Migrations have to deal with archived data too. Migrations may involve more than just schema changes – perhaps you need to fix a mistake with default values, or add a new column and backfill values. Is that going to work on records from 2 years ago? I've done migrations where these questions were not trivial to answer.

Restoring an archived record is not always as simple as just running SET archived_at = null – creating a record may involve making calls to external systems as well. I've seen complex restoration code that was always a buggy, partial implementation of the "create" API endpoint. In the end, we removed the specialized restoration code and required all restoration to go through the standard APIs – that simplified the server implementation, and ensured that old data that had since become invalid, could not be restored incorrectly – it needs to pass the new validation rules.

I'm not a fan of the archived_at column approach. It's simple at first, but in my experience, it's full of pitfalls down the line.

Let's look at some alternatives (in PostgreSQL): application events, triggers, and logical replication.

All these approaches store archived data separately from live data – that may be a separate database table, a separate database, object storage, etc.

One team I worked with took the approach of emitting an event at the application layer when a record was deleted. The event was sent to SQS, and another service would archive that object to S3 (among other things).

This had a few big benefits:

The primary database and application code were substantially simpler.
Deleting a resource involved cleaning up resources in various external systems. Handling this in an async background system improved performance and reliability.
The record and all its related records can be serialized to JSON in an application-friendly layout, rather than a serialized database table layout, so it's easier to work with.

The tradeoffs:

It's more likely to have a bug in the application code, and indeed this happened more than once, which meant archived records were lost and manual cleanup of external resources was necessary.
It's more infrastructure to understand and operate: multiple services, a message queue, etc.
Archived objects in S3 were not easy to query – finding records to restore required extra tooling from the customer support teams.

A trigger can copy a row to an archive table before it's deleted. The archive table can be a single, generic table that stores JSON blobs:

CREATE TABLE archive (
    id UUID PRIMARY KEY,
    table_name TEXT NOT NULL,
    record_id TEXT NOT NULL,
    data JSONB NOT NULL,
    archived_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    caused_by_table TEXT,
    caused_by_id TEXT
);

CREATE INDEX idx_archive_table_record ON archive(table_name, record_id);
CREATE INDEX idx_archive_archived_at ON archive(archived_at);

The trigger function converts the deleted row to JSON:

CREATE OR REPLACE FUNCTION archive_on_delete()
RETURNS TRIGGER AS $$
BEGIN
    INSERT INTO archive (id, table_name, record_id, data)
    VALUES (
        gen_random_uuid(),
        TG_TABLE_NAME,
        OLD.id::TEXT,
        to_jsonb(OLD)
    );
    RETURN OLD;
END;
$$ LANGUAGE plpgsql;

Attach this trigger to any table you want to archive:

CREATE TRIGGER archive_users
    BEFORE DELETE ON users
    FOR EACH ROW EXECUTE FUNCTION archive_on_delete();

CREATE TRIGGER archive_documents
    BEFORE DELETE ON documents
    FOR EACH ROW EXECUTE FUNCTION archive_on_delete();

Handling foreign key cascades

When a parent record is deleted, PostgreSQL cascades the delete to child records. These child deletes also fire triggers, but in the context of a cascade, you often want to know why a record was deleted.

One approach is to use a session variable to track the root cause:

CREATE OR REPLACE FUNCTION archive_on_delete()
RETURNS TRIGGER AS $$
DECLARE
    cause_table TEXT;
    cause_id TEXT;
BEGIN
    -- Check if we're in a cascade context
    cause_table := current_setting('archive.cause_table', true);
    cause_id := current_setting('archive.cause_id', true);

    -- If this is a top-level delete, set ourselves as the cause
    IF cause_table IS NULL THEN
        PERFORM set_config('archive.cause_table', TG_TABLE_NAME, true);
        PERFORM set_config('archive.cause_id', OLD.id::TEXT, true);
        cause_table := TG_TABLE_NAME;
        cause_id := OLD.id::TEXT;
    END IF;

    INSERT INTO archive (id, table_name, record_id, data, caused_by_table, caused_by_id)
    VALUES (
        gen_random_uuid(),
        TG_TABLE_NAME,
        OLD.id::TEXT,
        to_jsonb(OLD),
        cause_table,
        cause_id
    );
    RETURN OLD;
END;
$$ LANGUAGE plpgsql;

Now when you delete a user, you can see which archived documents were deleted because of that user:

SELECT * FROM archive
WHERE caused_by_table = 'users'
AND caused_by_id = '123e4567-e89b-12d3-a456-426614174000';

Tradeoffs

Triggers add some overhead to deletes, and the archive table will grow. But:

Live tables stay clean – no archived_at columns, no dead rows
Cleaning up the archive table is trivial with WHERE archived_at < NOW() - INTERVAL '90 days'.
Queries don't need to filter out archived records
Indexes stay efficient
Applications and migrations only deal with live data
Backups of the main tables are smaller

The archive table can even live in a separate tablespace or be partitioned by time if it grows large.

PostgreSQL's write-ahead log (WAL) records every change to the database. Change data capture (CDC) tools can read the WAL and stream those changes to external systems. For archiving, you'd filter for DELETE events and write the deleted records to another datastore.

Debezium is the most well-known tool for this. It connects to PostgreSQL's logical replication slot, reads changes, and publishes them to Kafka. From there, a consumer writes the data wherever you want – S3, Elasticsearch, another database, etc.

PostgreSQL → Debezium → Kafka → Consumer → Archive Storage

For simpler setups, there are lighter-weight alternatives:

pgstream – streams WAL changes directly to webhooks or message queues without Kafka
wal2json – a PostgreSQL plugin that outputs WAL changes as JSON, which you can consume with a custom script
pg_recvlogical – PostgreSQL's built-in tool for reading logical replication streams

Operational complexity

The main downside is operational overhead. You're running additional services that need to be monitored, maintained, and made fault-tolerant. Debezium with Kafka is a significant infrastructure investment – Kafka alone requires careful tuning and monitoring.

The lighter-weight alternatives reduce this burden but shift reliability concerns to your custom code. If your consumer crashes or falls behind, you need to handle that gracefully.

WAL retention and max_wal_size

A critical configuration is max_wal_size in PostgreSQL. The database retains WAL segments until all replication slots have consumed them. If your CDC consumer stops processing – due to a bug, network issue, or downstream failure – WAL segments accumulate on the primary.

If this continues unchecked, the primary database can run out of disk space and crash.

PostgreSQL 13+ has max_slot_wal_keep_size to limit how much WAL a slot can retain:

ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';

If a slot falls too far behind, PostgreSQL invalidates it rather than filling the disk. This protects the primary but means your CDC pipeline loses data and needs to be re-synced from a snapshot.

You need monitoring and alerting on replication slot lag. If a slot starts falling behind, you want to know before it becomes a crisis.

Tradeoffs

WAL-based CDC provides:

Captures all changes without modifying application code or adding triggers
Can stream to any destination (object storage, data warehouses, search indexes)
The primary database has no additional query load – it just writes WAL as normal

But:

Significant operational complexity, especially with Kafka-based setups
Risk to primary database stability if consumers fall behind
Schema changes require careful coordination between source and consumers
More infrastructure to understand, deploy, and debug

This approach makes the most sense when you already have Kafka or similar infrastructure, or when you need to stream changes to multiple destinations beyond just archiving.

This is an idea I had never considered until I wrote this post – I haven't tested this, it's just an idea.

What if you kept a PostgreSQL replica (e.g. using logical replication) that just didn't process DELETE queries? Would it effectively accumulate records and updates without conflict over time?

One potential benefit of this is that the archive can be easily queried, so finding old data is simple.

Would the replica have any information about deletes? Could it separate live from deleted data? Would you be able to find a record that was "deleted 2 hours ago in account 123" for a customer? Perhaps instead of ignoring DELETE queries entirely, you could have a specialized replica that transforms DELETE events into an archived_at column.

One potential pitfall here could be schema migrations – would the archive run into difficulty applying migrations over time?

Another downside might be cost – running a replica and keeping all that storage could have a non-trivial cost: it costs money and has operational overhead.

If I were starting a new project today and needed soft delete, I'd reach for the trigger-based approach first. It's simple to set up, keeps live tables clean, and doesn't require extra infrastructure. The archive table is easy to query when you need it, and easy to ignore when you don't.

If you have thoughts, comments, feedback, shoot me an email at [email protected].