DuckDB 用于 Kafka 流处理
Show HN: DuckDB for Kafka Stream Processing

原始链接: https://sql-flow.com/docs/tutorials/intro/

## SQLFlow:基于Kafka的流处理 本指南详细介绍了SQLFlow的设置方法,SQLFlow是一种流处理器,能够对Kafka流运行SQL查询。首先,安装依赖项(`pip install -r requirements.txt`)并拉取Docker镜像(`docker pull turbolytics/sql-flow:latest`)。 在处理实时数据之前,可以使用提供的fixtures通过`docker run`命令和`invoke`命令来测试配置,确保SQL逻辑正确。 要将SQLFlow作为流处理器运行,请启动Kafka实例(使用`docker-compose -f dev/kafka-single.yml up -d`)并将测试数据发布到Kafka主题(`python3 cmd/publish-test-data.py`)。然后,使用`docker run`启动SQLFlow,将其指向您的配置文件和Kafka brokers。 最后,通过使用Kafka控制台消费者(`docker exec -it kafka1 kafka-console-consumer...`)来消费输出主题,查看处理后的输出。目标是在5分钟内拥有一个可运行的流处理器,能够读取和处理Kafka数据。

## SQLFlow:基于DuckDB的轻量级流处理 SQLFlow是一个新的流处理引擎,围绕DuckDB构建,旨在简化传统上需要复杂JVM系统才能完成的任务。它的目标是使用最少的资源(约250MiB内存)处理每秒数万条消息,并利用DuckDB丰富的连接器生态系统。 其核心思想是使用SQL定义处理来自Kafka的“批次”数据的管道。虽然看起来是批处理,但SQLFlow通过持续流动数据并允许可配置的批次大小(甚至可以小到每次一条消息)来实现流处理,从而确保至少一次传递。 开发者对其潜力感到兴奋,尤其是在Flink等完整解决方案过于庞大的用例中。与Tributary等项目相比,SQLFlow专注于为生产环境提供完整的服务,包括DevOps工具(指标、测试、管道即代码)。目前的限制包括缺乏流到流的连接支持,这是开发者正在考虑优先处理的一个功能。 更多信息请访问:[https://sql-flow.com/](https://sql-flow.com/) 和该项目的GitHub仓库:[https://github.com/turbolytics/sql-flow](https://github.com/turbolytics/sql-flow)。
相关文章

原文

Create a stream processor that reads data from Kafka in less than 5 minutes.

Getting Started

Get started by running a stream processor that executes SQL against a kafka stream and writes the output to the console.

What you'll need

cd path/to/turbolytics/sql-flow/github/repo && pip install -r requirements.txt
  • The turbolytics/sql-flow docker image
docker pull turbolytics/sql-flow:latest
  • Kafka running on your local machine
cd path/to/turbolytics/sql-flow/github/repo && docker-compose -f dev/kafka-single.yml up -d

Test the SQLFlow configuration file

SQLFlow ships with cli support to test a stream configuration against any fixture file of test data. The goal is to support testing and linting of a configuration file before executing in a stream environment.

Run the invoke command to test the configuration file against a set of test data:

docker run -v $(pwd)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest dev invoke /tmp/conf/config/examples/basic.agg.mem.yml /tmp/conf/fixtures/simple.json

The following output should show:

[{'city': 'New York', 'city_count': 28672}, {'city': 'Baltimore', 'city_count': 28672}]

Run SQLFlow against a Kafka stream

This section runs SQLFlow as a stream processor that reads data from a Kafka topic and writes the output to the console. SQLFow runs as a daemon and will continuously read data from kafka, execute the SQL and write the output to the console.

  • Publish test messages to the Kafka topic
python3 cmd/publish-test-data.py --num-messages=10000 --topic="input-simple-agg-mem"
  • Start the Kafka Console Consumer, to view the SQLFlow output
docker exec -it kafka1 kafka-console-consumer --bootstrap-server=kafka1:9092 --topic=output-simple-agg-mem
docker run -v $(pwd)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow -e SQLFLOW_KAFKA_BROKERS=host.docker.internal:29092 turbolytics/sql-flow:latest run /tmp/conf/config/examples/basic.agg.mem.yml --max-msgs-to-process=10000

The following output should begin to show in the kafka console consumer:

...
...
{"city":"San Francisco504","city_count":1}
{"city":"San Francisco735","city_count":1}
{"city":"San Francisco533","city_count":1}
{"city":"San Francisco556","city_count":1}
...
联系我们 contact @ memedata.com