DSQL 如何确保序列扩展。

原文

Sequences are one of those Postgres features that you don’t think much about. You can ask for the next number in the sequence, and you get it. That works pretty well when you have one machine asking for the next number, but what about 10,000?

We’ve just launched Sequence support in DSQL and we’re excited about it. Up until now our recommendation has been to use UUIDs, and for truly massive scale it still is. But we recognize that there’s plenty of places where you’d like to use a unique number for an identity on a table.

If you just want to get started with sequences, here’s how you make one in DSQL:

CREATE SEQUENCE my_sequence CACHE 65536;
SELECT nextval('my_sequence'); 

Or if you want to use it as an identity in a column:

CREATE TABLE orders (
  id BIGINT GENERATED BY DEFAULT AS IDENTITY (CACHE 65536) PRIMARY KEY,
  customer_name TEXT
);

This blog is going to go more into the details of why sequences look like they do in DSQL, but to get started that’s all you need to know!

Sequences in DSQL

In single box (or single writer) SQL systems, sequences are very lightweight. They provide unique values without a unique index and without taking heavy locks. In Postgres, sequences are stored like any other data: in a table. That table is then stored for durability on disk, with writes and updates going through a log for crash recovery. When a backend process calls nextval() it reads the value, increments it, and writes the new value back. To avoid going to disk too much, backends can also cache a number of values, set by the CACHE value of the sequence, we’ll come back to that later.

In a distributed architecture things are a little less simple, but not by much! Marc’s blog on the circle of life is a good primer on DSQL’s architecture. When you call nextval(), the read goes to storage, checks the latest value, and increments the value before writing the update to the journal. So far so simple? The important thing to remember is that getting the next set of values in a sequence goes through a full circle of life.

In distributed systems, creating scalable applications is a mutual responsibility^{. The big problem with scaling sequences is they’re a
classic hot key. DSQL’s advantage is that we can support endless horizontal
scaling at every component, but to be able to scale DSQL needs to be able
to spread your changes out across workers. For typical inserts we do that
by partitioning the data. However, you can’t partition a single row table.}

CACHE to the rescue

Remember the CACHE value from sequences in Postgres? This sets how many values a given backend gets on each call to nextval(). So with CACHE=3 a backend would fetch 3 values on every call to nextval(), which they can then use in requests without performing extra expensive IO.

                         ┌─────────────┐
                         │    Disk     │
                         │  seq = 10   │
                         └──────┬──────┘
                                │
           ┌────────────────────┼────────────────────┐
           │                    │                    │
           ▼                    ▼                    ▼
    ┌─────────────┐      ┌─────────────┐      ┌─────────────┐
    │  Backend A  │      │  Backend B  │      │  Backend C  │
    │ cache: 1-3  │      │ cache: 4-6  │      │ cache: 7-9  │
    └─────────────┘      └─────────────┘      └─────────────┘
           │                    │                    │
           ▼                    ▼                    ▼
      nextval()=1          nextval()=4          nextval()=7
      nextval()=2          nextval()=5          nextval()=8
      nextval()=3          nextval()=6          nextval()=9

Each backend reserves a chunk of sequence values on its first call. Subsequent nextval() calls return from the local cache without going to disk. If a backend crashes, those values from the sequence are discarded, so higher cache values can result in gaps in sequences.

DSQL parallelises our compute in the form of the Query Processor. Instead of individual backends, statements are executed by QPs. Here, CACHE functions the same as in Postgres. When a QP calls nextval() it gets a cached set of values, and hands them out. So now to the elephant in the room for DSQL support, we only support CACHE=1 or CACHE>=65536.

The point of these values is to highlight the decision for the developer. Either you want your sequence to be densely packed and low throughput, or you want it to be able to scale. With large cache values (>=65k), sequences are rarely a bottleneck in DSQL transactions.

What if I don’t need scale though?

That is totally fine too! Not every project is trying to hit 100k TPS. We also know there’s plenty of applications where you have a slow rate of inserts and would prefer a dense, increasing sequence. That’s why DSQL supports CACHE=1.

To put some numbers on it, I ran some experiments^.

Each of these I tested with CACHE 1, CACHE 65536, and I provided an example with UUID for a value that doesn’t require coordination. Since UUID is always locally generated, there’s no way for it to conflict and serves as a good baseline.

This is what the DDL for my first test looks like:

CREATE SEQUENCE seq_cache_1 CACHE 1;
CREATE SEQUENCE seq_cache_65536 CACHE 65536;

Here in the first test I’m creating two sequences, one with CACHE=1, and one with CACHE=65536. We’re then fetching new values serially, so we’re making one request to get a new value, waiting until we get it back, and then making another. The majority of the time is spent in network time waiting for the request to go from my laptop to DSQL’s QP and back. You’ll notice that the high cache value is faster, because the QP I’m connected to isn’t having to fetch an update from Storage every time, but it’s not faster by much. Comparing with UUID, you can see that’s pretty much the same as our high cache option.

------------------------------------------------------------
Experiment 1: Individual nextval() calls (100 iterations)
------------------------------------------------------------

CACHE=1:
  Mean:  20.26 ms
  Min:   17.94 ms
  Max:   97.10 ms
  Total: 2025.65 ms

CACHE=65536:
  Mean:  13.60 ms
  Min:   11.53 ms
  Max:   26.95 ms
  Total: 1360.24 ms

UUID:
  Mean:  13.31 ms
  Min:   11.81 ms
  Max:   27.11 ms
  Total: 1330.73 ms

Speedup with CACHE=65536 vs CACHE=1: 1.5x faster
Speedup with UUID vs CACHE=1: 1.5x faster

Okay great! But that doesn’t really tell us that much about how it scales. We’re just using a single connection and fetching one value at a time. Let’s look now at the case of a bulk insert. So here, we’re inserting 1000 rows into a table with a sequence:

-- Creating the table
CREATE TABLE bench_table (id BIGINT, data TEXT);

-- My insert statements look like this:
INSERT INTO bench_table (id, data)
SELECT nextval('seq_cache_1'), 'row ' || g
FROM generate_series(1, 1000) g;

We’re still running on one connection, but now we’re running a bulk insert of 1000 rows instead of fetching the nextval a bunch of times. So what does that look like?

------------------------------------------------------------
Experiment 2: Bulk INSERT with 1000 rows
------------------------------------------------------------

CACHE=1:     6229.11 ms total
CACHE=65536: 83.16 ms total
UUID:        77.24 ms total

Sequence CACHE=65536 vs CACHE=1: 74.9x faster
UUID vs CACHE=1: 80.6x faster

Well that’s a big difference! The reason for this is that typically incrementing sequences don’t follow the transaction semantics that we have for other values. It would be strange if something like:

BEGIN;
select nextval('my_seq'); -- returns 4
ROLLBACK;
select nextval('my_seq'); -- returns 4 again ??

were to happen. To preserve the expectation that sequences always return a unique, incrementing value, under the hood they are created by special internal transactions. This means that every call to fetch a nextval is going through the DSQL circle of life. With that in mind, the results for the bulk insert on a single connection make sense! For CACHE=1, even though we’re only on one connection, the QP has to go through the full loop for each row, fetching a value from storage, writing back to the journal, waiting for the transaction to finish before the next value can be read. With a large CACHE value, our QP only needs to do that once. This is on a single region cluster, but on a multi-region cluster the difference would be even more marked, because we’d need to wait for the write to be committed to our second region.

Now that was still on a single connection, what about when we actually want to scale? How do they behave when we throw more connections at them? This experiment is the same as experiment one, except we’re running contested. Instead of just one connection, let’s create 100, and have each of those fetch 100 next values:

------------------------------------------------------------
Experiment 3: Conflict test (100 workers x 100 nextvals each)
------------------------------------------------------------

CACHE=1:
  Total time:  51589.87 ms
  Throughput:  193.8 calls/sec
  Mean:        353.03 ms
  Min:         17.16 ms
  Max:         21182.18 ms
  Errors:      0

CACHE=65536:
  Total time:  3020.29 ms
  Throughput:  3310.9 calls/sec
  Mean:        17.46 ms
  Min:         10.57 ms
  Max:         1538.64 ms
  Errors:      0

UUID:
  Total time:  2902.25 ms
  Throughput:  3445.6 calls/sec
  Mean:        15.00 ms
  Min:         10.86 ms
  Max:         1439.94 ms
  Errors:      0

Throughput speedup with CACHE=65536 vs CACHE=1: 17.1x
Throughput speedup with UUID vs CACHE=1: 17.8x

The throughput difference is again just as marked. In the CACHE=1 case, the majority of internal transactions to fetch a cache value are conflicting. DSQL hides the internal details where these would show up as OCC errors, instead this shows up as latency, as conflicts would in regular Postgres. With high cache values we have almost no contention. Comparing to our baseline of UUIDs we can see the difference is minimal.

So what do I use?

If you want to use a sequence our recommendation is to use a high cache value. It’s going to keep up with your scale and avoid being a bottleneck in your system. If you really want densely packed sequences and you don’t expect your table to ever be running higher than a few transactions per second, then CACHE=1 will work just fine. If you change your mind or see it becoming a blocker down the line, you can always go back and fix it with:

ALTER SEQUENCE my_seq CACHE 65536;

But if you truly don’t want to worry about scale, just use UUIDs:

CREATE TABLE orders (
  id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
  --...
);