PostgreSQL 的内存大小至关重要

PostgreSQL 的内存大小至关重要
Memory Size Matters to PostgreSQL

原始链接: https://pgdba.org/post/2025/04/size_matter/

PostgreSQL的共享缓冲区是数据流动的关键内存段，显著影响性能。最初使用简单的LRU算法，经过几次效率较低的迭代后，发展到目前的时钟扫描算法。该算法使用“空闲列表”保存可替换候选缓冲区，并使用循环缓冲区(nextVictimBuffer)管理驱逐。空闲列表为空时，算法会扫描共享缓冲区，递减使用计数器以查找要驱逐的缓冲区。批量操作使用不同大小的环形缓冲区绕过主缓冲区。虽然PostgreSQL文档建议将shared_buffers设置为系统RAM的25%作为起始点，但如果整个数据区域都适合放入共享缓冲区，则更大的共享缓冲区可能更有优势。然而，如果数据区域超过缓冲区大小，由于时钟扫描算法的全扫描，性能可能会下降。在拥有大量RAM的情况下，超过64GB的shared_buffer可能会导致性能倒退。调整默认的shared_buffer大小对于获得最佳性能至关重要。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 PostgreSQL 的内存大小很重要 (pgdba.org) 9 分，作者 PaulHoule，1 小时前 | 隐藏 | 过去 | 收藏 | 讨论加入我们，参加 6 月 16-17 日在旧金山举办的 AI 初创企业学校！指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们搜索：

（评论） 2024-03-20

零停机 Postgres 升级 2023-12-14

PostgreSQL 和 UUID 作为主键 2024-07-07

Postgres 作为队列 2024-02-11

原文

Nowadays it’s not uncommon to deal with machines with hundreds of GB of RAM.

Abundant memory can give PostgreSQL a massive performance boost. However things work slightly different than you may expect.

Let’s find out!

The shared buffer is a segment allocated by PostgreSQL in the server’s memory for managing the flow of data between the data area and the backends.

Whether we are reading or changing the data, the database physical pages are stored into shared buffer.

During the normal database activity it’s quite normal the eviction of stored buffers to make space for data pages not present into the shared buffer.

In the ancient PostgreSQL 7.4 there was a simple last recently used (LRU) algorithm in place. This simplistic approach had many limitations, in particular because it didn’t consider whether the candidate buffer was used frequently.

With PostgreSQL 8.0.0 the memory manager was replaced by the adaptive replacement cache (ARC). However, because the ARC was patented it was replaced by less efficient algorithm in PostgreSQL 8.0.1.

PostgreSQL 8.1 replaced again the inefficient 2q algorithm with a new implementation called the clock sweep that it’s is still in use in the modern PostgreSQL.

The clock sweep

In the shared buffer there is a “free list” of buffers prime candidates for replacement. The buffers that contain no valid page are always in this list.
It should be possible to add buffers into this list, for example if their pages are unlikely to be needed soon but the current algorithm never does that.

The normal buffer replacement strategy is therefore managed by the clock sweep via a circular buffer named nextVictimBuffer protected by a separate system-wide spinlock called buffer_strategy_lock.

This spinlock provides mutual exclusion for operations that access the buffer free list or select buffers for replacement. While buffer_strategy_lock is held then no other locks of any sort should be acquired allowing the buffer replacement to happen in multiple backends with reasonable concurrency.

As explained in src/backend/storage/buffer/README each buffer in memory has an usage counter increased by one unit every time a buffer is pinned by a backend up to a small limited value.

When searching for a buffer to evict the clock sweep algorithm does the following:

Obtain the buffer_strategy_lock.
If there are buffers in the free list then remove its head buffer and release buffer_strategy_lock. If the selected buffer is pinned or has a nonzero usage count ignore it and go back to step 1. Otherwise, pin the buffer, and return it.
When the buffer free list is empty then select the buffer pointed to by nextVictimBuffer circularly advance nextVictimBuffer ready for the next time. Then release the buffer_strategy_lock.
If the selected buffer is pinned or has a nonzero usage count decrement its usage count if nonzero then reacquire buffer_strategy_lock, and return to step 3 to examine the next buffer.
Pin the selected buffer, and return.

Note that if the selected buffer is dirty then it’s necessary to write it out before it can be recycled it. If the buffer is pinned in meanwhile then the entire process will give up and another buffer will be tried.

The first thing that appears clear is that the free list is used only for a fixed period of time after the instance is started. When all the buffers have a valid page then the free list becomes empty and the algorithm will not populate it again. With the free list empty then the nextVictimBuffer will always move through the shared buffer searching for a candidate for eviction.

    ---
id: 7abddde8-f6cd-4e1c-998e-1afa6c2bc327
---
flowchart TD
    A(Start Clock Sweep) --> B{Is Current Buffer a Candidate?}
    D --> B
    B -- No --> D[Advance Clock Hand]
    
    B -- Yes --> C{Is Buffer in Use?}

    
    
    C -- Yes --> D[Advance Clock Hand]
    C -- No --> F{Is Buffer Dirty?}
    F -- Yes --> G[Schedule Write to Disk]
    F -- No --> H[Evict Buffer]
    
    G --> I
    H --> J[Replace with New Buffer]
    I[Wait for Write Completion] --> H
    J --> K(End Clock Sweep)
    style A fill:#fff,stroke:#333,stroke-width:2px
    style B fill:#ccf,stroke:#333,stroke-width:2px
    style C fill:#9cf,stroke:#333,stroke-width:2px
    style D fill:#fcc,stroke:#333,stroke-width:2px
    
    style F fill:#f9c,stroke:#333,stroke-width:2px
    style G fill:#f6c,stroke:#333,stroke-width:2px
    style H fill:#c9f,stroke:#333,stroke-width:2px
    style I fill:#9ff,stroke:#333,stroke-width:2px
    style J fill:#9fc,stroke:#333,stroke-width:2px
    style K fill:#eee,stroke:#333,stroke-width:2px

As the usage counter is incremented on buffer pin it’s also clear that when the free list becomes empty then all the buffers in the shared buffer will have the usage count set at least to 1. Therefore it will be necessary at least one entire shared buffer scan before finding a candidate buffer.

The ring buffer strategy

The normal buffer replacement strategy is overridden under certain conditions that may cause the shared buffer complete rewrite.

Bulk operations as VACUUM or a large sequential scan will use a small ring buffer that will not affect the rest of the shared buffer.

In particular for large sequential scans a 256KB ring is used.

For VACUUM the size of the ring is controlled by the vacuum_buffer_usage_limit GUC.

For bulk writes (currently only COPY IN and CREATE TABLE AS SELECT) the ring size will be of 16MB but not more than 1/8th of shared_buffers size though.

The PostgreSQL documentation suggests to set the shared_buffer to 25% of the system’s RAM a as a starting point discouraging to go up to 40%.

The shared buffer set at 25% RAM works fine when the server’s memory sits between 4 GB and 100 GB with a range between 1GB and 25GB.

However with memory less than 4 GB sizing correctly the shared buffer is not an easy task and depends on many factors including the vm settings on linux.

The interesting aspect of the shared_buffer size is when there is plenty of RAM.

If for example we consider a system with 400 GB a 25% RAM shared buffer should be 100 GB.

The performance of a large shared buffer depends from the underlying data area.

The most favorable scenario is when the data area is smaller than the shared buffer. All the buffers will be cached in the shared buffer using only the free list. No extra work will be then required by the clock sweep to manage the memory. The usage counters will reach the maximum values and will never go down.

However if the data area doesn’t fit in the shared buffer then the free list will become empty and then the normal buffer replacement strategy will kick in. When a buffer is loaded in memory is pinned and therefore the usage counter starts with one. That means that with a full packed shared buffer the nextVictimBuffer should scan all buffers at least one time to find a buffer with usage counter set to zero.

Modern DDR4 memories have a theoretical throughput of 25-30 GB/s. This is more realistically ranging between 5-10 GB/s. With a 100 GB full packed shared buffer the time required to perform one single full scan ranges between 3 and 20 seconds.

Obviously the timing may vary greatly depending on the conditions and the workload.

If for example the nextVictimBuffer sets the usage counter to zero to a buffer but then another backend pins it before it’s scanned again, then the buffer can’t be evicted at the next scan increasing the wait time for a new buffer.

As rule of thumb considering 64GB the upper limit for the shared_buffer before a regression seems reasonable.

Sizing the PostgreSQL’s shared buffer is not an easy task. It’s very important to understand the way the buffer manager deal with the block eviction in order to get the correct setting for the underlying data area.

In any case, unless you are doing local tests it’s very important to change the default value of shared_buffer set by initdb.

The parameter’s value is conservatively set to a small amount of memory (currently 128MB) and the fact that PostgreSQL just works even with the default settings is not a good reason to forget it.