运行PB级ClickHouse集群的经验教训：第二部分

运行PB级ClickHouse集群的经验教训：第二部分
Lessons learned operating petabyte-scale ClickHouse clusters: Part II

原始链接: https://www.tinybird.co/blog-posts/what-i-learned-operating-clickhouse-part-ii

本文深入探讨了在 ClickHouse 中处理读密集型工作负载的方法，强调读性能与写操作和数据导入实践密切相关。文章特别强调了管理不同类型的流量（实时流量、长时间运行的流量和回填流量）的重要性，并建议使用负载均衡和副本分配等策略来避免影响实时查询的延迟。优化的关键在于理解 ClickHouse 缺乏内置的查询优化器，因此用户驱动的查询设计至关重要。文章概述了最佳实践：首先根据排序键列进行过滤，对高选择性过滤器使用 PREWHERE，并将复杂的运算（如连接和分组）推迟执行。回填物化视图需要小心处理，建议避免使用“POPULATE”，因为它可能导致数据重复。监控集群的关键指标（如查询负载、ZooKeeper 延迟、复制延迟以及来自系统表的错误日志，特别是 system.query_log、system.processes 和 system.part_log）对于维护系统稳定性至关重要。最后，谨慎管理表删除、物化视图和某些列类型可以防止意外的服务器问题。

Hacker News上的一篇讨论线程关注在运营PB级ClickHouse集群中获得的经验教训。用户riku_iki表达了对需要监控ClickHouse的段错误崩溃的担忧，暗示这质疑了该数据库的生产级质量。javisantana，很可能参与了最初的文章，回应道，承认确实会发生崩溃，尽管并不频繁。他解释说，他们广泛的工作负载种类使他们更容易遇到此类问题。他还将此与Postgres进行了比较，Postgres是一个被广泛认为是“生产级”的系统，它也会偶尔发生崩溃，这表明孤立的崩溃并不一定否定系统的整体质量。讨论围绕着ClickHouse在大规模下的可靠性和操作注意事项展开。

原文

This is the second part of this series. You can read the first one here.

Handling load

This section is mostly about reads. I talked about ingestion in the previous post, and while reads and writes could use the same resources, I'm going to focus on reads in isolation, as if you only had reads.

Well, I lied. Because I'm going to start by telling you: you can't decouple reads and writes. If you see any benchmark that only gives read performance, it may look nice in the benchmark, but that's not true in real life.

Reads depend a lot on how many parts a table has, and the number of parts depends on the ingestion. If you are inserting data often, you'll get penalized while reading no matter what schema your table has (more about this in performance). You can reduce parts by running merges more often, but you'll need more CPU.

When it comes to reads, there are many topics I could cover. I'll start here:

Handling different kinds of traffic
Hardware resources
Query design

About handling different kinds of traffic and hardware

In a cluster, you usually have:

Real-time traffic, aka queries that need to answer in less than X seconds, usually serving a dashboard or some real-time workload.
Long-running queries, which could be from an ad hoc analysis (like someone who decides they need to know the average of something over 7 years of data)
Backfills (this requires its own section).
Anything else that's not real time, that is, queries where it doesn't matter how long it takes to run.

Solving this is a never-ending design topic: how do I not let my long-running queries affect my latency on real-time queries, especially the ones over p95? There are many ways to approach it, for example, by having different replicas for different workloads. But that's expensive because you need to allocate hardware for both of them, and while real-time traffic is usually predictable, one-off queries just "happen", and you can't plan ahead. You could be smarter and handle that in the application layer (before sending the query), but you probably have 3-4 different sources of queries: the app, people running clickhouse-client, people running a BI, and people running a BI. Yes, I counted it twice because BI queries are usually really bad.

You could create spot replicas to run those long-running queries, but that needs extra handling and adds complexity. At Tinybird, we handle it with a combination of app logic and a load balancer. We know the status of the replicas, and based on that, we pick the right one to send the query. Sometimes we reject the query to avoid crashing the server.

About query design

Queries have infinite configuration options, but one of the most important ones is max_threads. It controls how many threads you can use to read the data. In general, real-time queries should only use 1, and if you need more than 1-2, you'll need a lot of hardware if you have many QPS. You need to understand that, most of the time, your bottleneck here is scan speed. So you can keep adding CPUs, but you are still limited by scan size. A rule of thumb: you can scan 500Mb/s on a single machine on average. This is, of course, not true for every kind of machine or workload, but it's a good rule of thumb. The other important settings are those that control memory:

max_memory
max_bytes_before_external_group_by

Both control how the memory is used, and you'll need to control how much memory you give to each query. If you want to run massive group by, joins, window functions, or anything like that, you want max_memory to be high. If you want those queries to be fast, max_bytes_before_external_group_by should be high to avoid ClickHouse dumping partial results to disk and killing performance.

ClickHouse does not have an optimizer; you are the optimizer. And your users don't care about optimizing (that feature or report needs to be done ASAP). Query optimization is an art, but there are some simple rules you have to follow:

Filter on columns in the sorting key first. Sorting key design is super important as well, so pay attention to it. Believe me, you are losing 10-100x on your queries with a bad sorting key design. Spend 2-3 hours understanding the data layout
Run other filters, second, and try to use PREWHERE if you have large columns. Move high-selectivity filters to prewhere on columns that are small (not Strings, not Arrays).
Then run everything that filters data out, mostly IN operations
Save JOINs, GROUP BY, and other complex operations for last.

If you master these rules, I can guarantee you will be in the top p95 of ClickHouse users. Actually, use any LLM and just give it those rules and table schema, and it will do a great job. By the way, we wrote more about these things here.

It's not that simple, as you may have expected, otherwise query planners would be a few lines of code (and they are not). Depending on the cardinality etc., you may need to change the rules.

So that's Query Design 101, and yes, it is really more about the user than the operator, but you still need to know those things because you are going to be fighting with users running bad queries all the time. Because even ClickHouse has a max total memory config, and it'll OOM. OOM is bad; depending on your database, it could take minutes to load (so you also need to design your HA setup properly)

The other setting you want to set is max_concurrent_queries. This can save you when the server is overloaded, so do not remove this value. You may need to change it depending on your workload, but keep it, it's a lifesaver.

Start building with Tinybird

If you've read this far, you might want to use Tinybird as your analytics backend. Start for free with no time limit.

Backfills

I'm adding this section because it's so painful that if I manage to save even 1 person 1 hour of pain, this article would be worth it.

Let me explain the scenario: you are inserting into a table, that table has a materialized view, you add another materialized view, and you want to backfill it. This sounds simple, but it's not.

You might be tempted to use POPULATE... Don't. It's broken because data can be duplicated. From the official documentation:

We do not recommend using POPULATE, since data inserted in the table during the view creation will not be inserted in it.

If you run an INSERT INTO SELECT * FROM table, you risk losing or duplicating data, and you'll need a lot of memory to do it.

So, when backfilling materialized views, here's what to do: use a Null table before the actual table, and increase the rows you push per block to avoid generating thousands of parts.

I can’t miss the opportunity to explain how to do this in Tinybird. Just change the MV SQL and then tb deploy, we’ll take care of all of this.

Operating the cluster and monitoring

Aside from the regular stuff you monitor (CPU, mem, IO), you also want to track things like:

Number of queries running
ZooKeeper latencies
S3 errors (if you use it)
Replication lag
DDL queue length (this is the queue that stores what all the replicas need to execute)
Merge queue
Merges memory
Mutations

Things get stuck sometimes, and killing the queries does not always work, so keep an eye on everything.

You should track errors as well. These are the important ones:

Read only tables (when you hit this one, you are very likely fucked)
ZooKeeper errors
Max simultaneous queries.

You should learn to work with ClickHouse system tables. You can add a lot of metrics and everything, but when things go south, this is what will save you.

system.query_log is your bible. This tells you everything about queries. The ProfileEvents column is where you go to understand what a query did. Learn how to use it, master it, and learn the column names.
system.processes tells you what's currently running. Useful in fires.
system.part_log tells you how parts move, what was merged, and so on.
You need to learn about system.tables and system.columns to understand the table shapes.

You may want to enable some others, but those are the critical ones.

Track segfaults as well. They might happen, and you'll need to identify which query caused it. This does not happen often, and if your workload is stable, I don't think you'll ever see one. But, if they happen, and you don't identify the query causing it, your cluster will go down all the time. And you don't want to be on call all the time.

I'm not going to talk too much about the tooling, some people use K8s with the Altinity operator, others handle it with scripts, but everybody needs to:

Be aware of adding and removing replica nodes to the cluster
Be careful dropping table replicas
Keep an eye on replication lag when adding a new replica
Remain aware of long-running queries when turning off a replica
Watch inserts

And so on. We do a multi-step process at Tinybird to shut a replica down that takes care of all of this (first remove traffic, then wait for the inserts and queries, kill them if needed).

Be careful handling ON CLUSTER operations. As you may know, changing a table in the cluster requires some coordination. For that, ClickHouse uses Zookeeper or ClickHouse Keeper to coordinate replicas. If a replica is down, ON CLUSTER operations will take a lot of time (depending on the config) and may generate issues in your app if timeouts are not properly set (and trust me, they're not).

For super-low-latency use cases (under 20ms per query), you need to warm caches (take a look at the settings), because if you start sending traffic to a replica, the latencies will go up like crazy. Watch the memory available for the mark_cache. There is also an uncompressed cache, but, to be honest, we never used it successfully, so let me know if you did.

In general, alert on max_simultaneous_queries, connectivity issues, and queues growing without control. You will add more alerts on other things, but that depends heavily on what your cluster is designed for.

Last note: Altinity's knowledge base (and videos) is probably one of the best repositories to understand how to set up ClickHouse on your own (and in general, all the content that Altinity publishes is really good. A must-read if you handle ClickHouse).

Other stuff

Other random stuff you may find useful:

Table deletion

By default, ClickHouse does not remove any table with more than 50 GB. This makes total sense, and you may be tempted to lower the limit to a few Mb. There are two kinds of developers/data engineers: 1) The ones who removed a table by mistake, and 2) the ones who will. If you drop a table by mistake, you'll need to go to the backups and that's a pain in the ass (unless you use the engine we developed, of course)

Materialized views

MVs are a killer feature, but they are hard to manage, they lead to memory issues, and they generate a lot of parts. In general, you need to be careful with MVs because people tend to add a lot of things in them. MVs are executed on inserts, so every new MV will make your ingestion slower.

Don't kill your server

Some column types can make your server go south, for example, a uniqExactState over a column with high cardinality (+200m) can kill your cluster on merges. That usually happens with aggregating columns, so be careful with them.

index_granularity can also kill your server if it's too low. My recommendation: do not go under 128. Low values are useful for point-like queries.

Subscribe to our newsletter

Links to our blog and other great reads sent every other Saturday.

Final thoughts

I focused a lot on things that can go wrong with ClickHouse, but that doesn't mean ClickHouse is a bad database—quite the opposite. Handling a database at scale is hard; if you have enough experience, you know it.

Anyways, I hope this saves you some headaches. (If you didn't read the first post, here it is again.)