This is the second part of this series. You can read the first one here.
Handling load
This section is mostly about reads. I talked about ingestion in the previous post, and while reads and writes could use the same resources, I'm going to focus on reads in isolation, as if you only had reads.
Well, I lied. Because I'm going to start by telling you: you can't decouple reads and writes. If you see any benchmark that only gives read performance, it may look nice in the benchmark, but that's not true in real life.
Reads depend a lot on how many parts a table has, and the number of parts depends on the ingestion. If you are inserting data often, you'll get penalized while reading no matter what schema your table has (more about this in performance). You can reduce parts by running merges more often, but you'll need more CPU.
When it comes to reads, there are many topics I could cover. I'll start here:
- Handling different kinds of traffic
- Hardware resources
- Query design
About handling different kinds of traffic and hardware
In a cluster, you usually have:
- Real-time traffic, aka queries that need to answer in less than X seconds, usually serving a dashboard or some real-time workload.
- Long-running queries, which could be from an ad hoc analysis (like someone who decides they need to know the average of something over 7 years of data)
- Backfills (this requires its own section).
- Anything else that's not real time, that is, queries where it doesn't matter how long it takes to run.
Solving this is a never-ending design topic: how do I not let my long-running queries affect my latency on real-time queries, especially the ones over p95? There are many ways to approach it, for example, by having different replicas for different workloads. But that's expensive because you need to allocate hardware for both of them, and while real-time traffic is usually predictable, one-off queries just "happen", and you can't plan ahead. You could be smarter and handle that in the application layer (before sending the query), but you probably have 3-4 different sources of queries: the app, people running clickhouse-client, people running a BI, and people running a BI. Yes, I counted it twice because BI queries are usually really bad.
You could create spot replicas to run those long-running queries, but that needs extra handling and adds complexity. At Tinybird, we handle it with a combination of app logic and a load balancer. We know the status of the replicas, and based on that, we pick the right one to send the query. Sometimes we reject the query to avoid crashing the server.
About query design
Queries have infinite configuration options, but one of the most important ones is max_threads
. It controls how many threads you can use to read the data. In general, real-time queries should only use 1, and if you need more than 1-2, you'll need a lot of hardware if you have many QPS. You need to understand that, most of the time, your bottleneck here is scan speed. So you can keep adding CPUs, but you are still limited by scan size. A rule of thumb: you can scan 500Mb/s on a single machine on average. This is, of course, not true for every kind of machine or workload, but it's a good rule of thumb. The other important settings are those that control memory:
max_memory
max_bytes_before_external_group_by
Both control how the memory is used, and you'll need to control how much memory you give to each query. If you want to run massive group by, joins, window functions, or anything like that, you want max_memory
to be high. If you want those queries to be fast, max_bytes_before_external_group_by
should be high to avoid ClickHouse dumping partial results to disk and killing performance.
ClickHouse does not have an optimizer; you are the optimizer. And your users don't care about optimizing (that feature or report needs to be done ASAP). Query optimization is an art, but there are some simple rules you have to follow:
- Filter on columns in the sorting key first. Sorting key design is super important as well, so pay attention to it. Believe me, you are losing 10-100x on your queries with a bad sorting key design. Spend 2-3 hours understanding the data layout
- Run other filters, second, and try to use
PREWHERE
if you have large columns. Move high-selectivity filters to prewhere on columns that are small (not Strings, not Arrays). - Then run everything that filters data out, mostly
IN
operations - Save
JOIN
s,GROUP BY
, and other complex operations for last.
If you master these rules, I can guarantee you will be in the top p95 of ClickHouse users. Actually, use any LLM and just give it those rules and table schema, and it will do a great job. By the way, we wrote more about these things here.
It's not that simple, as you may have expected, otherwise query planners would be a few lines of code (and they are not). Depending on the cardinality etc., you may need to change the rules.
So that's Query Design 101, and yes, it is really more about the user than the operator, but you still need to know those things because you are going to be fighting with users running bad queries all the time. Because even ClickHouse has a max total memory config, and it'll OOM. OOM is bad; depending on your database, it could take minutes to load (so you also need to design your HA setup properly)
The other setting you want to set is max_concurrent_queries
. This can save you when the server is overloaded, so do not remove this value. You may need to change it depending on your workload, but keep it, it's a lifesaver.
Start building with Tinybird
If you've read this far, you might want to use Tinybird as your analytics backend. Start for free with no time limit.
Backfills
I'm adding this section because it's so painful that if I manage to save even 1 person 1 hour of pain, this article would be worth it.
Let me explain the scenario: you are inserting into a table, that table has a materialized view, you add another materialized view, and you want to backfill it. This sounds simple, but it's not.
You might be tempted to use POPULATE
... Don't. It's broken because data can be duplicated. From the official documentation:
We do not recommend using POPULATE, since data inserted in the table during the view creation will not be inserted in it.
If you run an INSERT INTO SELECT * FROM table
, you risk losing or duplicating data, and you'll need a lot of memory to do it.
So, when backfilling materialized views, here's what to do: use a Null table before the actual table, and increase the rows you push per block to avoid generating thousands of parts.
I can’t miss the opportunity to explain how to do this in Tinybird. Just change the MV SQL and then tb deploy
, we’ll take care of all of this.
Operating the cluster and monitoring
Aside from the regular stuff you monitor (CPU, mem, IO), you also want to track things like:
- Number of queries running
- ZooKeeper latencies
- S3 errors (if you use it)
- Replication lag
- DDL queue length (this is the queue that stores what all the replicas need to execute)
- Merge queue
- Merges memory
- Mutations
Things get stuck sometimes, and killing the queries does not always work, so keep an eye on everything.
You should track errors as well. These are the important ones:
- Read only tables (when you hit this one, you are very likely fucked)
- ZooKeeper errors
- Max simultaneous queries.
You should learn to work with ClickHouse system tables. You can add a lot of metrics and everything, but when things go south, this is what will save you.
system.query_log
is your bible. This tells you everything about queries. TheProfileEvents
column is where you go to understand what a query did. Learn how to use it, master it, and learn the column names.system.processes
tells you what's currently running. Useful in fires.system.part_log
tells you how parts move, what was merged, and so on.- You need to learn about
system.tables
andsystem.columns
to understand the table shapes.
You may want to enable some others, but those are the critical ones.
Track segfaults as well. They might happen, and you'll need to identify which query caused it. This does not happen often, and if your workload is stable, I don't think you'll ever see one. But, if they happen, and you don't identify the query causing it, your cluster will go down all the time. And you don't want to be on call all the time.
I'm not going to talk too much about the tooling, some people use K8s with the Altinity operator, others handle it with scripts, but everybody needs to:
- Be aware of adding and removing replica nodes to the cluster
- Be careful dropping table replicas
- Keep an eye on replication lag when adding a new replica
- Remain aware of long-running queries when turning off a replica
- Watch inserts
And so on. We do a multi-step process at Tinybird to shut a replica down that takes care of all of this (first remove traffic, then wait for the inserts and queries, kill them if needed).
Be careful handling ON CLUSTER
operations. As you may know, changing a table in the cluster requires some coordination. For that, ClickHouse uses Zookeeper or ClickHouse Keeper to coordinate replicas. If a replica is down, ON CLUSTER
operations will take a lot of time (depending on the config) and may generate issues in your app if timeouts are not properly set (and trust me, they're not).
For super-low-latency use cases (under 20ms per query), you need to warm caches (take a look at the settings), because if you start sending traffic to a replica, the latencies will go up like crazy. Watch the memory available for the mark_cache
. There is also an uncompressed cache, but, to be honest, we never used it successfully, so let me know if you did.
In general, alert on max_simultaneous_queries
, connectivity issues, and queues growing without control. You will add more alerts on other things, but that depends heavily on what your cluster is designed for.
Last note: Altinity's knowledge base (and videos) is probably one of the best repositories to understand how to set up ClickHouse on your own (and in general, all the content that Altinity publishes is really good. A must-read if you handle ClickHouse).
Other stuff
Other random stuff you may find useful:
Table deletion
By default, ClickHouse does not remove any table with more than 50 GB. This makes total sense, and you may be tempted to lower the limit to a few Mb. There are two kinds of developers/data engineers: 1) The ones who removed a table by mistake, and 2) the ones who will. If you drop a table by mistake, you'll need to go to the backups and that's a pain in the ass (unless you use the engine we developed, of course)
Materialized views
MVs are a killer feature, but they are hard to manage, they lead to memory issues, and they generate a lot of parts. In general, you need to be careful with MVs because people tend to add a lot of things in them. MVs are executed on inserts, so every new MV will make your ingestion slower.
Don't kill your server
Some column types can make your server go south, for example, a uniqExactState
over a column with high cardinality (+200m) can kill your cluster on merges. That usually happens with aggregating columns, so be careful with them.
index_granularity
can also kill your server if it's too low. My recommendation: do not go under 128. Low values are useful for point-like queries.
Subscribe to our newsletter
Links to our blog and other great reads sent every other Saturday.
Final thoughts
I focused a lot on things that can go wrong with ClickHouse, but that doesn't mean ClickHouse is a bad database—quite the opposite. Handling a database at scale is hard; if you have enough experience, you know it.
Anyways, I hope this saves you some headaches. (If you didn't read the first post, here it is again.)