Quack-Cluster：一个使用DuckDB和Ray的无服务器分布式SQL查询引擎

Quack-Cluster：一个使用DuckDB和Ray的无服务器分布式SQL查询引擎
Quack-Cluster: A Serverless Distributed SQL Query Engine with DuckDB and Ray

原始链接: https://github.com/kristianaryanto/Quack-Cluster

## Quack-Cluster：大数据无服务器分布式SQL Quack-Cluster是一个高性能、无服务器SQL查询引擎，用于大规模数据分析。它允许用户直接在对象存储（如S3或Google Cloud Storage）中存储的数据上运行复杂的SQL查询，*无需*复杂的大数据基础设施。 Quack-Cluster基于Python、Ray和DuckDB构建，它将查询分发到Ray集群，每个worker利用DuckDB进行快速的内存处理。它可以原生读取数据文件（Parquet、CSV等），无需ETL，并与现有的数据科学工作流程无缝集成。 **主要特性包括：** 无服务器操作、高速SQL处理、直接数据访问、Python集成以及完全开源的堆栈。查询会被解析并生成分布式执行计划，从而实现大规模并行处理（MPP），将文件转化为强大的分布式数据库。使用Docker和`make`即可轻松入门，允许用户通过API端点或Postman等工具快速部署和测试查询。Quack-Cluster支持DuckDB SQL方言的丰富子集，包括连接、聚合和窗口函数。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Quack-Cluster：使用 DuckDB 和 Ray 的无服务器分布式 SQL 查询引擎 (github.com/kristianaryanto) 6 分，由 tanelpoder 1 小时前发布 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

Quack-Cluster is a high-performance, serverless distributed SQL query engine designed for large-scale data analysis. It allows you to run complex SQL queries directly on data in object storage (like AWS S3 or Google Cloud Storage) by leveraging the combined power of Python, the Ray distributed computing framework, and the hyper-fast DuckDB analytical database.

It's an ideal, lightweight alternative to complex big data systems for all your analytical needs.

✨ Core Features: A Modern Distributed Database

Serverless & Distributed: Effortlessly run SQL queries on a scalable Ray cluster. Forget about managing complex server infrastructure for your database needs.
High-Speed SQL Processing: Utilizes the incredible speed of DuckDB's in-memory, columnar-vectorized query engine and the efficiency of the Apache Arrow data format for blazing-fast analytics.
Query Data Where It Lives: Natively reads data files (Parquet, CSV, etc.) directly from object storage like AWS S3, Google Cloud Storage, and local filesystems. No ETL required.
Python-Native Integration: Built with Python, Quack-Cluster integrates seamlessly into your existing data science, data engineering, and machine learning workflows.
Open Source Stack: Built with a powerful, modern stack of open-source technologies, including FastAPI, Ray, and DuckDB.

🏛️ Architecture: How Quack-Cluster Executes Distributed SQL

The Quack-Cluster system is designed for simplicity and scale. It distributes SQL queries across a Ray cluster, where each worker node uses an embedded DuckDB instance to process a portion of the data in parallel.

A User sends a standard SQL query to the Coordinator's API endpoint.
The Coordinator (FastAPI + SQLGlot) parses the SQL, identifies the target files (e.g., using wildcards like s3://my-bucket/data/*.parquet), and generates a distributed execution plan.
The Ray Cluster orchestrates the execution by sending tasks to multiple Worker nodes.
Each Worker (a Ray Actor) runs an embedded DuckDB instance to execute its assigned query fragment on a subset of the data.
Partial results are efficiently aggregated by the Coordinator and returned to the user.

This architecture enables massive parallel processing (MPP) for your SQL queries, turning a collection of files into a powerful distributed database.

graph TD
    subgraph User
        A[Client/SDK]
    end

    subgraph "Quack-Cluster (Distributed SQL Engine)"
        B[Coordinator API<br>FastAPI + SQLGlot]

        subgraph "Ray Cluster (Distributed Computing)"
            direction LR
            W1[Worker 1<br>Ray Actor + DuckDB]
            W2[Worker 2<br>Ray Actor + DuckDB]
            W3[...]
        end
    end

    subgraph "Data Source"
        D[Object Storage<br>e.g., AWS S3, GCS, Local Files]
    end

    A -- "SELECT * FROM 's3://bucket/*.parquet'" --> B
    B -- "Distributed Execution Plan" --> W1
    B -- "Distributed Execution Plan" --> W2
    B -- "Distributed Execution Plan" --> W3
    W1 -- "Read data partition" --> D
    W2 -- "Read data partition" --> D
    W3 -- "Read data partition" --> D
    W1 -- "Partial SQL Result (Arrow)" --> B
    W2 -- "Partial SQL Result (Arrow)" --> B
    W3 -- "Partial SQL Result (Arrow)" --> B
    B -- "Final Aggregated Result" --> A

🚀 Getting Started: Deploy Your Own Distributed SQL Cluster

You only need Docker and make to get a local Quack-Cluster running.

Docker
make (pre-installed on Linux/macOS; available on Windows via WSL).

# 1. Clone this repository
git clone https://github.com/your-username/quack-cluster.git
cd quack-cluster

# 2. Generate sample data (creates Parquet files in ./data)
make data

# 3. Build and launch your distributed cluster
# This command starts a Ray head node and 2 worker nodes.
make up scale=2

Your cluster is now running! You can monitor the Ray cluster status at the Ray Dashboard: http://localhost:8265.

👨‍🏫 Tutorial: Running Distributed SQL Queries

Use any HTTP client like curl or Postman to send SQL queries to the API. The engine automatically handles file discovery with wildcards.

Example: Aggregate Sales Data from Multiple Parquet Files

This query calculates the total sales for each product across all data_part_*.parquet files.

curl -X 'POST' \
  'http://localhost:8000/query' \
  -H 'Content-Type: application/json' \
  -d '{
    "sql": "SELECT product, SUM(sales) as total_sales FROM \"data_part_*.parquet\" GROUP BY product ORDER BY product"
  }'

Expected Output:

{
  "result": [
    {"product": "A", "total_sales": 420.0},
    {"product": "B", "total_sales": 400.0},
    {"product": "C", "total_sales": 300.0}
  ]
}

🚀 Testing with Postman

You can easily test all API features using the provided Postman collection.

Import the Collection and Environment:
- In Postman, click Import and select the following files:
- Collection: documentation/postman_collection/QuackCluster_API_Tests.json
- Environment: documentation/postman_collection/QuackCluster_postman_environment.json
Activate the Environment:
- In the top-right corner of Postman, select "Quack Cluster Environment" from the environment dropdown list.
Send a Request:
- The environment pre-configures the baseUrl variable to http://127.0.0.1:8000. You can now run any of the pre-built requests in the collection.

💡 Powerful Distributed SQL Capabilities with DuckDB

Quack-Cluster supports a rich subset of the DuckDB SQL dialect, enabling complex analytical queries across multiple files and directories.

Basic Queries: SELECT, FROM, WHERE, GROUP BY, ORDER BY, LIMIT.
Aggregate Functions: COUNT(), SUM(), AVG(), MIN(), MAX().
Distributed Joins: INNER JOIN, LEFT JOIN, FULL OUTER JOIN across different file sets.
Advanced SQL:
- Subqueries (e.g., WHERE IN (...))
- Common Table Expressions (CTEs) using the WITH clause.
- Window Functions (e.g., SUM(...) OVER (PARTITION BY ...)).
- Advanced SELECT syntax like SELECT * EXCLUDE (...) and SELECT COLUMNS('<regex>').
File System Functions: Query collections of Parquet or CSV files using glob patterns (e.g., "s3://my-data/2025/**/*.parquet").

🛠️ Development & Management Commands

Use these make commands to manage your development lifecycle.

Command	Description
`make up scale=N`	Starts the cluster with `N` worker nodes.
`make down`	Stops and removes running containers safely.
`make logs`	Tails the logs from all services.
`make build`	Rebuilds the Docker images after a code change.
`make test`	Runs the `pytest` suite inside the `ray-head` container.
`make clean`	DANGER: Stops containers and deletes all data volumes.

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. Please feel free to open an issue or submit a pull request.

🤖 AI-Powered Development

This project leverages AI tools to accelerate development and improve documentation.

All core architectural decisions, debugging, and final testing are human-powered to ensure quality and correctness.

This project is licensed under the MIT License. See the LICENSE file for details.