自行分区Postgres的陷阱

自行分区Postgres的陷阱
The pitfalls of partitioning Postgres yourself

原始链接: https://hatchet.run/blog/postgres-partitioning

Hatchet，一个基于Postgres构建的持久队列，最初随着任务量增长到数亿每天，性能开始下降。为了解决索引和表膨胀以及删除速度慢的问题，团队对他们的核心表（v1_task 和 v1_task_event）实施了基于时间的分割。他们选择自定义分割系统以避免外部扩展依赖。在成功发布和负载测试后，性能出乎意料地下降——涉及分割表连接的查询速度急剧减慢，原因是Postgres查询计划器对行数的估计不准确。尽管仔细配置了自动vacuum，问题仍然存在。根本原因在于Postgres分割行为的一个细微之处：`ANALYZE` 不会自动在父分割表上运行。手动在这些表上运行 `ANALYZE` 立即解决了问题，恢复了查询性能。团队现在定期在父表上运行 `ANALYZE`，并建议使用 `DETACH PARTITION...CONCURRENTLY` 删除分割，以及 `DETACH PARTITION...FINALIZE` 进行清理，并确保自动vacuum在子分割上运行。这次经历强调了在实施分割时理解Postgres内部机制的重要性。

这个Hacker News讨论围绕PostgreSQL内置分区功能的复杂性，特别是突出了意想不到的陷阱。发帖者详细描述了在分区表时遇到的挑战，导致他们需要频繁运行`ANALYZE`操作，因为查询模式不断变化。他们还质疑双重分区设置（例如，按时间序列然后按枚举）中的中间表是否需要分析。评论者分享了类似的经验，指出分区数量随着时间增长可能会成为一个“定时炸弹”，并建议多级分区（年->周）作为一种可能的解决方案。一个关键的结论是，虽然PostgreSQL提供了原生分区功能，但需要深入理解才能避免性能问题并保持准确的查询计划——它并不总是那么简单。许多用户指出文档和讨论（链接在评论中）详细介绍了这些细微之处和潜在的“坑”。

原文

If you dive deep into Hatchet's codebase these days, you might get the impression that we don't know what we're doing. In particular, for a seasoned Postgres user, this method might stand out to you:

Loading syntax highlighting...

This method goes through a handful of our Postgres tables and runs ANALYZE <table_name>; , seemingly at random.

Yes, we do have autovacuum enabled on our databases (and we expect users who are self-hosting to have it enabled as well), which means that tables are autoanalyzed after they get autovacuumed. Why are we manually running analyze on our tables all the time?

While there are a bunch of ways in which we don't know what we're doing, this particular bit of code comes from a hard-won set of production incidents, and some nuances of Postgres's autoanalyze implementation.

Why partition?

For some background, Hatchet is a durable queue, built on top of Postgres, which offers useful features for running complex background jobs, like workflows and durable execution. As a result, we store up to hundreds of millions of tasks per day inside of a set of Postgres tables. In the first version of Hatchet, we used to assign each task a UUID and store runs in a single table. This worked well, until we hit ~200 million rows and started to see our indexes and tables bloat.

Even worse was deleting old tasks to keep storage costs down; we had to run a chunked deletion process almost 24/7, and because of various foreign keys and constraints, this wasn't always fast enough to keep up with the ever-growing size of the Postgres table.

So we started looking into time-based partitioning of our tables. For those unaware, Postgres has built-in support for partitioning, including range-based and hash-based partitioning.

We also explored a few extensions, like pg_partman and TimescaleDB, but Hatchet is open-source, and we wanted to make it possible to deploy Hatchet to any Postgres provider without requiring managing an extension.

We decided to roll our own partitioning system.

Building our partitioning system

It's surprisingly simple to do your own partitioning with Postgres. The most difficult part — or so we thought — is managing the partitions and (in our case) making sure they get properly detached and deleted after a certain retention window.

We partition only a handful of tables that would otherwise have billions of rows. We're going to focus on two tables, v1_task (the table which holds our task data), and v1_task_event, which stores an event history per task (like completion events). Here's a simplified schema:

Loading syntax highlighting...

Our partitions are daily and rolling, so we can't just declare the partitions in the schema during a migration. We therefore have an application-side process which is responsible for:

Creating new partitions ahead of time (with a 2-day lead time, so we have plenty of time to respond in case a partition doesn't get created).
Dropping old partitions after the retention window. To avoid locking the parent partition table for longer than necessary, we utilize DETACH PARTITION...CONCURRENTLY followed by DROP TABLE.

This last trick of using CONCURRENTLY avoids taking out as restrictive of a lock on the parent table, but risks an orphaned detach process which must be finalized if it fails.

The rollout of our new partitioning system initially went without a hitch. We had bundled this into our Hatchet v1 migration (essentially a rewrite of the entire Hatchet backend), and had run a bunch of load tests against the new system to make sure that our partitioning held up under load.

The incident

For a few days, everything was great, until we noticed some queries getting slower. This trend continued for the next few days, with some queries on the partitioned tables taking up to 10x longer than they should, from just a few milliseconds to >20 milliseconds. This resulted in massive spikes of active sessions and CPU when under even moderate amounts of load:

Incident

We noticed that the slow queries had two things in common:

They all involved a partitioned query in a JOIN statement somewhere.
They selected many rows using a compound primary key. Our partitioned tables all have compound primary keys and we do a lot of batched reads and writes.

At first, we suspected that we simply had too many partitions that were being scanned, and there was some type of shared lock which was preventing or slowing index scans on each table. To test this hypothesis, we tried to get the Postgres query planner to prune partitions in the planning phase. But this didn't help.

We next tried to reproduce the issue locally. This proved rather difficult, because the volume of data that we stored in these tables wasn't easy to deal with locally. But we were fortunately able to come up with a rather trivial example which illustrated the query discrepancy.

After creating 14 days worth of partitions, each with 1 million rows in them, we ran an EXPLAIN ANALYZE on a query quite similar to the following:

Loading syntax highlighting...

This resulted in the following query plan: https://explain.dalibo.com/plan/fa91509f3bfh84e8 . Most notably, the row estimate is off by a factor of 6100000x!

Bad query plan

Why row estimates matter

Postgres has a built-in tool called the query planner to pick the optimal query strategy. This query planner uses statistics on the target table to figure out whether or not to use an index, which type of merge strategy to use on joins, etc.

If the query planner doesn't have correct statistics, it can create wildly suboptimal query plans, such as deciding to do a sequential scan across an entire table instead of an index-only scan. If you expect a query which filters specifically by a primary key to take a few milliseconds, but instead that query starts doing a different type of scan taking tens or hundreds of milliseconds, it can build up database pressure incredibly quickly.

The process responsible for maintaining query statistics is ANALYZE . We tune autovacuum settings quite carefully on our databases, ensuring that all high-throughput tables are autovacuumed very frequently and that we don't have backlogs on autovacuum processes.

So why were our query statistics off?

The solution

At this point, we were quite stumped, so we called in an expert: Laurenz Albe. After going through our repro, Laurenz made the suggestion to run ANALYZE on the parent table manually. We were a little skeptical, but we ran it anyway - and this immediately fixed our problems.

It turns out that we missed a little footnote in the Postgres docs:

Partitioned tables do not directly store tuples and consequently are not processed by autovacuum. (Autovacuum does process table partitions just like other tables.) Unfortunately, this means that autovacuum does not run ANALYZE on partitioned tables, and this can cause suboptimal plans for queries that reference partitioned table statistics. You can work around this problem by manually running ANALYZE on partitioned tables when they are first populated, and again whenever the distribution of data in their partitions changes significantly.

We put up a PR, and all of our rollout issues were resolved.

Slack message

As an aside: why didn't our load tests catch this? Unfortunately, our "hot-path" from when a task is queued to when it's executed very rarely involves large JOINs on the partitioned tables, and this was the bulk of our focus for the load testing. We also didn't populate many days worth of partitions in our load tests, opting instead to push a single partition to its limit, so perhaps the query planner was able to make better estimates with fewer partitions.

Tips on partitioning

We learned a few things that were rather non-obvious from partitioning tables ourselves - primarily:

Use DETACH PARTITION...CONCURRENTLY to reduce lock contention on the parent partition table when removing old partitions
Use DETACH PARTITION...FINALIZE to clean up orphaned partitions, since CONCURRENTLY isn't transactional.
Run ANALYZE every once in a while on your parent partition tables (and ensure that autovacuum is running for the child partitions)

We are very grateful to Laurenz for helping us debug this issue! For more reading, there's also a post on the Cybertec blog about this.