ClickHouse 的 GlassFlow 负载测试：大规模实时去重

ClickHouse 的 GlassFlow 负载测试：大规模实时去重
Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale

原始链接: https://www.glassflow.dev/blog/load-test-glass-flow-for-click-house-real-time-deduplication-at-scale

GlassFlow是一个针对Kafka和ClickHouse的实时流式ETL引擎，我们对其进行了严格的性能测试，模拟了一个真实的去重流水线。这个开源的测试环境使用了MacBook Pro上的Docker，处理包含10%重复率的合成用户事件数据。测试目标是在8小时内，评估其在启用去重功能下的高负载性能。结果显示，GlassFlow能够处理来自Kafka的每秒高达55,000条记录，在去重的情况下每秒处理超过9,000条记录，延迟低于0.12毫秒。即使处理2000万条记录和12个并发发布者，系统也保持稳定，没有出现崩溃、消息丢失或乱序的情况。虽然吞吐量受限于本地机器资源，但性能保持一致。延迟保持较低，滞后时间与数据摄取速率和数据量成正比，展现出可预测的行为。测试证实了GlassFlow对于实时数据流水线，特别是去重任务的有效性，且不会影响性能，使其适用于对正确性要求很高的分析场景。该测试环境是开源的，用户可以自行复制和测试。

GlassFlow, an open-source streaming ETL tool for deduplicating and joining Kafka streams to ClickHouse, shared load test results on Hacker News. The test, performed on a MacBook Pro (M3 Max), processed 20 million records with Kafka producing 55,000 records/sec, and GlassFlow achieving a deduplication rate of 9,000+ records/sec with <0.12ms end-to-end latency. The founders emphasized the reproducibility of the test, providing setup and results documentation. While some users questioned the single-machine setup and relatively low throughput compared to potential workloads, the GlassFlow team clarified their focus on Kafka to ClickHouse pipelines with exactly-once guarantees, a feature not natively supported by Kafka. They aim to provide a more efficient alternative to tools like Flink while offering resilience. They acknowledged the feedback and are developing a Kubernetes-ready, horizontally scalable version for higher throughput, promising to share those results soon.

原文

By Ashish Bagri, Co-founder & CTO of GlassFlow

TL;DR

We tested GlassFlow on a real-world deduplication pipeline with Kafka and ClickHouse.
It handled 55,00 records/sec published by Kafka and processed 9,000+ records/sec on a MacBook Pro, with sub-0.12ms latency.
No crashes, no message loss, no disordering. Even with 20M records and 12 concurrent publishers, it remained robust.
Want to try it yourself? The full test setup is open source: https://github.com/glassflow/clickhouse-etl-loadtest and the docs https://docs.glassflow.dev/load-test/setup

Why this test?

ClickHouse is incredible at fast analytics. But when building real-time pipelines from Kafka to ClickHouse, many teams run into the same issues: analytics results are incorrect or too delayed to support real-time use cases.

The root cause? Data duplications and slow joins. They are often introduced by retries, offset reprocessing, or downstream enrichment. These problems can affect both correctness and performance.

That’s why we built GlassFlow: A real-time streaming ETL engine designed to process Kafka streams before data hits ClickHouse.

After launching the product, we often received the question, “How does it perform at high loads?”

With this post, we want to give a clear and reproducible answer to that. This article walks through what we tested, how we set it up, and what we found when testing deduplications with GlassFlow.

What is GlassFlow image (1).png

GlassFlow is an open-source streaming ETL service developed specifically for ClickHouse. It is a real-time stream processing solution designed to simplify data pipeline creation and management between Kafka and ClickHouse. It supports:

Real-time deduplication (configurable window, event ID based)
Stream joins between topics
Exactly-once semantics
Native ClickHouse sink with efficient batching and buffering

GlassFlow handles the hard parts: state, ordering, retries and batching.

More about GlassFlow at our prev HN post https://news.ycombinator.com/item?id=43953722

Before we dive in, here’s what you should know about how we ran the test.

Data Used: Simulating a Real-World Use Case

For this benchmark, we use synthetic data that simulates a real-world use case: logging user events in an application.

Each record represents an event triggered by a user, similar to what you'd see in analytics or activity tracking systems.

Here's the schema:

Field	Type	Description
`event_id`	UUID (v4)	Unique ID for the event
`user_id`	UUID (v4)	Unique ID for the user
`name`	String	Full name of the user
`email`	String	User’s email address
`created_at`	Datetime (`%Y-%m-%d %H:%M:%S`)	Timestamp of when the event occurred

This structure helps simulate insert-heavy workloads and time-based queries—perfect for testing how GlassFlow performs with ClickHouse in a realistic, high-volume setting.

Infrastructure Setup

Infra Setup (1).png

For this benchmark, we will be running the load test locally using Docker to simulate the entire data pipeline. The setup included:

Kafka: Running in a Docker container to handle event streaming.
ClickHouse: Also containerized, serving as the storage layer.
GlassFlow ETL: Deployed in Docker, responsible for processing messages from Kafka and writing them to ClickHouse.

While the setup supports running against cloud-hosted Kafka and ClickHouse, we chose to keep everything local to maintain control over the environment and ensure consistent test conditions.

Each test run automatically creates the necessary Kafka topics and ClickHouse tables before starting, and cleans them up afterward. This keeps the environment clean between runs and ensures reproducible results.

Resource Used for Testing

The load tests were conducted on a MacBook Pro with the following specifications:

Specification	Details
Model Name	MacBook Pro
Model Identifier	Mac14,5
Model Number	MPHG3D/A
Chip	Apple M2 Max
Total Number of Cores	12 (8 performance and 4 efficiency)
Memory	32 GB

Additional Assumptions

Furthermore, to push our implementation to the limits, we do the following:

We use an example where we have incoming data with some amount of duplication (10%, to be exact) and we need to deduplicate it.
We perform incremental tests with growing data volume at each step (starting from 5 million records moving our way up to 20 million records).
Apart from this, we also change several parameters and see how that impacts our overall performance.

So, let’s start with the actual test.

We created a load test repo so you can run this benchmark yourself in minutes (check it out here). Using this, we ran a series of local load tests that mimicked a real-time streaming setup. The goal was simple: push a steady stream of user event data through a Kafka → GlassFlow → ClickHouse pipeline and observe how well it performs with meaningful data transformations applied along the way.

Pipeline Configuration

Pipeline Config (1).png

The setup followed a typical streaming architecture:

Kafka handled the event stream, fed by synthetic user activity.
GlassFlow processed the stream in real time, applying transformations before passing it downstream.
ClickHouse served as the destination where all processed data was written and later queried.

Each test run spun up its own Kafka topics and ClickHouse tables automatically. Everything was cleaned up once the run was complete, leaving no leftover state. This kept the environment fresh and the results reliable.

Transformations Applied

glassflow-clickhouse (1).jpg

As discussed in the previous section, to make the test more realistic, we applied a deduplication transformation using the event_id field. The goal was to simulate a scenario where events could be sent more than once due to retries or upstream glitches. The deduplication logic looked for repeated events within an 8-hour window and dropped the duplicates before they hit ClickHouse.

No complex joins or filters were applied in this run, keeping the focus on how well GlassFlow could handle high event volumes and real-time processing with exactly-once semantics.

Monitoring and Observability Setup

Throughout the test, we kept a close eye on key performance metrics:

Throughput — Events processed per second, from Kafka to ClickHouse.
Latency — Time taken from ingestion to storage.
Kafka Lag — How far behind the processor was from the latest Kafka event.
CPU & Memory Usage — For each component in the pipeline.

These were visualized using pre-built Grafana dashboards that gave a live view into system behavior. It was especially useful for spotting bottlenecks and confirming whether back pressure or resource constraints were kicking in.

Test Execution

We ran multiple test iterations, each processing between 5 to 20 million records, with parallelism levels ranging from 2 to 12 workers. Around 10% of the events were duplicates, which tested the deduplication mechanism effectively. Additionally, we setup various configurable parameters that allowed us to test the limits of GlassFlow:

Parameter	Required/Optional	Description	Example Range/Values	Default
num_processes	Required	Number of parallel processes	1-N (step: 1)	-
total_records	Required	Total number of records to generate	5,000,000-20,000,000 (step: 500,000)	-
duplication_rate	Optional	Rate of duplicate records	0.1 (10% duplicates)	0.1
deduplication_window	Optional	Time window for deduplication	[“1h”, “4h”]	“8h”
max_batch_size	Optional	Max batch size for the sink	[5000]	5000
max_delay_time	Optional	Max delay time for the sink	[”10s”]	”10s”

For each parameter, you can either define a fixed value and go a step further and define a range to run multiple combinations of the test using the configured values. Here is a sample of configuration that you can setup when using our repository:

Each test ran until all records were processed, and the pipeline drained completely. By the end, we had a clear picture of how throughput and latency scaled with load—and how stable the system remained under pressure.

With the setup complete, let’s look at the results.

We ran this benchmark by using the same GlassFlow pipeline across all the sets and setting different parameters as shown above. Here are the GlassFlow pipeline configurations we use:

Parameter	Value
Duplication Rate	0.1
Deduplication Window	8h
Max Delay Time	10s
Max Batch Size (GlassFlow Sink - Clickhouse)	5000

Now, as we discussed above, we look at a particular performance metrics to gauge how GlassFlow performs. Across all our tests, both the CPU and memory usage on our Mac remained stable and efficient even during extended test runs.

So, here are the results that we obtained:

Variant ID	#records (millions)	#Kafka Publishers (num_processes)	Source RPS in Kafka (records/s)	GlassFlow RPS (records/s)	Average Latency (ms)	Lag (sec)
load_9fb6b2c9	5.0	2	8705	8547	0.117	10.1
load_0b8b8a70	10.0	2	8773	8653	0.1156	15.04
load_a7e0c0df	15.0	2	8804	8748	0.1143	10.04
load_bd0fdf39	20.0	2	8737	8556	0.1169	47.74
load_1542aa3b	5.0	4	17679	9189	0.1088	260.55
load_a85a4c42	10.0	4	17738	9429	0.1061	495.97
load_5efd111b	15.0	4	17679	9341	0.1071	756.49
load_23da167d	20.0	4	17534	9377	0.1066	991.77
load_883b39a0	5.0	6	25995	8869	0.1128	370.57
load_b083f89f	10.0	6	26226	9148	0.1093	710.97
load_462558f4	15.0	6	26328	9191	0.1088	1061.44
load_254adf29	20.0	6	26010	8391	0.1192	1613.62
load_0c3fdefc	5.0	8	34384	8895	0.1124	415.78
load_3942530b	10.0	8	33779	8747	0.1143	846.26
load_d2c1783c	15.0	8	34409	9067	0.1103	1217.37
load_febf151f	20.0	8	35135	9121	0.1096	1622.75
load_993c0bc5	5.0	10	40256	8757	0.1142	445.76
load_022e44e5	10.0	10	38715	8687	0.1151	891.8
load_0adbae83	15.0	10	39820	8694	0.115	1347.66
load_77d67ac7	20.0	10	40458	8401	0.119	1885.24
load_af120520	5.0	12	37691	8068	0.124	485.95
load_c9424931	10.0	12	45743	8610	0.1161	941.66
load_ee837ca6	15.0	12	45539	8605	0.1162	1412.48
load_ac40b143	20.0	12	49005	8878	0.1126	1843.61
load_675d04f3	5.0	12	40382	8467	0.1181	465.66
load_28956d50	10.0	12	55829	8018	0.1247	1066.62

💡

Note: The last two tests (load_675d04f3 and load_28956d50) use a higher records per second value to see how it would impact the performance.

Well, before we analyze these results, let’s take a look at few visualizations we created to get a better idea of how GlassFlow actually performed:

image (6).png

image (7).png

After running a series of sustained load tests, the results gave a clear picture of how GlassFlow behaves under pressure—and the performance was impressive across the board. Here's what stood out:

Throughout the test, the system remained rock-solid—even when pushing up to 55,000 records per second into Kafka. There were no crashes, memory leaks, or failures. GlassFlow handled deduplication flawlessly, consistently filtering out repeated events without missing a beat. No message loss or disordering was observed, which speaks volumes about the reliability of the pipeline.
GlassFlow’s processing rate remained stable under varying loads. In the current setup (running inside a Docker container on a local machine), the system consistently processed upwards of over 9,000 records per second.

However, this peak appears to be more a reflection of available system resources—CPU and memory—rather than a limitation of GlassFlow itself. With more powerful hardware or a scaled-out deployment (cloud deployment, for instance), it's likely this ceiling could be pushed higher. 3. Lag in the pipeline measured as the time difference between event ingestion into Kafka and its appearance in ClickHouse was closely tied to two factors:

Ingestion Rate: Higher Kafka ingestion RPS naturally led to higher lag, especially when it exceeded the 9,000 RPS GlassFlow could sustain.
Volume of Data: For a fixed RPS, increasing the total number of events extended the lag over time, which was expected as the buffer filled up.

In other words, once Kafka was producing faster than GlassFlow could consume, the lag started to climb. This is normal in streaming systems and highlights where autoscaling or distributed processing would come into play in a production setup.

So, to summarize the above interpretations, here are my final takeaways:

GlassFlow remained stable and consistent under high event rates.
Processing throughput maxed out at ~9K RPS, limited by local machine resources.
Processing latency remained extremely low (<0.12ms). Even at peak load and max event volume (20M records), latency didn’t spike.
Lag increased proportionally with ingestion rates and event volume—no surprises, but a great signal for where scaling would help.

Hence, it’s fair to say that these results give us a lot of confidence in using GlassFlow for real-time event pipelines, especially when paired with a scalable backend like ClickHouse.

The above test proves that GlassFlow is indeed a great tool for real-time stream processing with ClickHouse and it seamlessly integrates with Kafka. Deduplication does not compromise performance, making GlassFlow suitable for correctness-critical analytics use cases.

Now, it’s time for you to get your hands dirty and create your own tests using our load test repository. Here is the link to the repo again for your reference: https://github.com/glassflow/clickhouse-etl-loadtest.