DuckDB 用于 Kafka 流处理

DuckDB 用于 Kafka 流处理
Show HN: DuckDB for Kafka Stream Processing

原始链接: https://sql-flow.com/docs/tutorials/intro/

## SQLFlow：基于Kafka的流处理本指南详细介绍了SQLFlow的设置方法，SQLFlow是一种流处理器，能够对Kafka流运行SQL查询。首先，安装依赖项（`pip install -r requirements.txt`）并拉取Docker镜像（`docker pull turbolytics/sql-flow:latest`）。在处理实时数据之前，可以使用提供的fixtures通过`docker run`命令和`invoke`命令来测试配置，确保SQL逻辑正确。要将SQLFlow作为流处理器运行，请启动Kafka实例（使用`docker-compose -f dev/kafka-single.yml up -d`）并将测试数据发布到Kafka主题（`python3 cmd/publish-test-data.py`）。然后，使用`docker run`启动SQLFlow，将其指向您的配置文件和Kafka brokers。最后，通过使用Kafka控制台消费者（`docker exec -it kafka1 kafka-console-consumer...`）来消费输出主题，查看处理后的输出。目标是在5分钟内拥有一个可运行的流处理器，能够读取和处理Kafka数据。

## SQLFlow：基于DuckDB的轻量级流处理 SQLFlow是一个新的流处理引擎，围绕DuckDB构建，旨在简化传统上需要复杂JVM系统才能完成的任务。它的目标是使用最少的资源（约250MiB内存）处理每秒数万条消息，并利用DuckDB丰富的连接器生态系统。其核心思想是使用SQL定义处理来自Kafka的“批次”数据的管道。虽然看起来是批处理，但SQLFlow通过持续流动数据并允许可配置的批次大小（甚至可以小到每次一条消息）来实现流处理，从而确保至少一次传递。开发者对其潜力感到兴奋，尤其是在Flink等完整解决方案过于庞大的用例中。与Tributary等项目相比，SQLFlow专注于为生产环境提供完整的服务，包括DevOps工具（指标、测试、管道即代码）。目前的限制包括缺乏流到流的连接支持，这是开发者正在考虑优先处理的一个功能。更多信息请访问：[https://sql-flow.com/](https://sql-flow.com/) 和该项目的GitHub仓库：[https://github.com/turbolytics/sql-flow](https://github.com/turbolytics/sql-flow)。

原文

Create a stream processor that reads data from Kafka in less than 5 minutes.

Getting Started

Get started by running a stream processor that executes SQL against a kafka stream and writes the output to the console.

What you'll need

cd path/to/turbolytics/sql-flow/github/repo && pip install -r requirements.txt

The turbolytics/sql-flow docker image

docker pull turbolytics/sql-flow:latest

Kafka running on your local machine

cd path/to/turbolytics/sql-flow/github/repo && docker-compose -f dev/kafka-single.yml up -d

Test the SQLFlow configuration file

SQLFlow ships with cli support to test a stream configuration against any fixture file of test data. The goal is to support testing and linting of a configuration file before executing in a stream environment.

Run the invoke command to test the configuration file against a set of test data:

docker run -v $(pwd)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest dev invoke /tmp/conf/config/examples/basic.agg.mem.yml /tmp/conf/fixtures/simple.json

The following output should show:

[{'city': 'New York', 'city_count': 28672}, {'city': 'Baltimore', 'city_count': 28672}]

Run SQLFlow against a Kafka stream

This section runs SQLFlow as a stream processor that reads data from a Kafka topic and writes the output to the console. SQLFow runs as a daemon and will continuously read data from kafka, execute the SQL and write the output to the console.

Publish test messages to the Kafka topic

python3 cmd/publish-test-data.py --num-messages=10000 --topic="input-simple-agg-mem"

Start the Kafka Console Consumer, to view the SQLFlow output

docker exec -it kafka1 kafka-console-consumer --bootstrap-server=kafka1:9092 --topic=output-simple-agg-mem

docker run -v $(pwd)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow -e SQLFLOW_KAFKA_BROKERS=host.docker.internal:29092 turbolytics/sql-flow:latest run /tmp/conf/config/examples/basic.agg.mem.yml --max-msgs-to-process=10000

The following output should begin to show in the kafka console consumer:

...
...
{"city":"San Francisco504","city_count":1}
{"city":"San Francisco735","city_count":1}
{"city":"San Francisco533","city_count":1}
{"city":"San Francisco556","city_count":1}
...

DuckDB 用于 Kafka 流处理 Show HN: DuckDB for Kafka Stream Processing

DuckDB 用于 Kafka 流处理
Show HN: DuckDB for Kafka Stream Processing