Quack-Cluster is a high-performance, serverless distributed SQL query engine designed for large-scale data analysis. It allows you to run complex SQL queries directly on data in object storage (like AWS S3 or Google Cloud Storage) by leveraging the combined power of Python, the Ray distributed computing framework, and the hyper-fast DuckDB analytical database.
It's an ideal, lightweight alternative to complex big data systems for all your analytical needs.
- Serverless & Distributed: Effortlessly run SQL queries on a scalable Ray cluster. Forget about managing complex server infrastructure for your database needs.
- High-Speed SQL Processing: Utilizes the incredible speed of DuckDB's in-memory, columnar-vectorized query engine and the efficiency of the Apache Arrow data format for blazing-fast analytics.
- Query Data Where It Lives: Natively reads data files (Parquet, CSV, etc.) directly from object storage like AWS S3, Google Cloud Storage, and local filesystems. No ETL required.
- Python-Native Integration: Built with Python, Quack-Cluster integrates seamlessly into your existing data science, data engineering, and machine learning workflows.
- Open Source Stack: Built with a powerful, modern stack of open-source technologies, including FastAPI, Ray, and DuckDB.
The Quack-Cluster system is designed for simplicity and scale. It distributes SQL queries across a Ray cluster, where each worker node uses an embedded DuckDB instance to process a portion of the data in parallel.
- A User sends a standard SQL query to the Coordinator's API endpoint.
- The Coordinator (FastAPI + SQLGlot) parses the SQL, identifies the target files (e.g., using wildcards like
s3://my-bucket/data/*.parquet), and generates a distributed execution plan. - The Ray Cluster orchestrates the execution by sending tasks to multiple Worker nodes.
- Each Worker (a Ray Actor) runs an embedded DuckDB instance to execute its assigned query fragment on a subset of the data.
- Partial results are efficiently aggregated by the Coordinator and returned to the user.
This architecture enables massive parallel processing (MPP) for your SQL queries, turning a collection of files into a powerful distributed database.
graph TD
subgraph User
A[Client/SDK]
end
subgraph "Quack-Cluster (Distributed SQL Engine)"
B[Coordinator API<br>FastAPI + SQLGlot]
subgraph "Ray Cluster (Distributed Computing)"
direction LR
W1[Worker 1<br>Ray Actor + DuckDB]
W2[Worker 2<br>Ray Actor + DuckDB]
W3[...]
end
end
subgraph "Data Source"
D[Object Storage<br>e.g., AWS S3, GCS, Local Files]
end
A -- "SELECT * FROM 's3://bucket/*.parquet'" --> B
B -- "Distributed Execution Plan" --> W1
B -- "Distributed Execution Plan" --> W2
B -- "Distributed Execution Plan" --> W3
W1 -- "Read data partition" --> D
W2 -- "Read data partition" --> D
W3 -- "Read data partition" --> D
W1 -- "Partial SQL Result (Arrow)" --> B
W2 -- "Partial SQL Result (Arrow)" --> B
W3 -- "Partial SQL Result (Arrow)" --> B
B -- "Final Aggregated Result" --> A
You only need Docker and make to get a local Quack-Cluster running.
- Docker
make(pre-installed on Linux/macOS; available on Windows via WSL).
# 1. Clone this repository
git clone https://github.com/your-username/quack-cluster.git
cd quack-cluster
# 2. Generate sample data (creates Parquet files in ./data)
make data
# 3. Build and launch your distributed cluster
# This command starts a Ray head node and 2 worker nodes.
make up scale=2Your cluster is now running! You can monitor the Ray cluster status at the Ray Dashboard: http://localhost:8265.
Use any HTTP client like curl or Postman to send SQL queries to the API. The engine automatically handles file discovery with wildcards.
This query calculates the total sales for each product across all data_part_*.parquet files.
curl -X 'POST' \
'http://localhost:8000/query' \
-H 'Content-Type: application/json' \
-d '{
"sql": "SELECT product, SUM(sales) as total_sales FROM \"data_part_*.parquet\" GROUP BY product ORDER BY product"
}'Expected Output:
{
"result": [
{"product": "A", "total_sales": 420.0},
{"product": "B", "total_sales": 400.0},
{"product": "C", "total_sales": 300.0}
]
}You can easily test all API features using the provided Postman collection.
-
Import the Collection and Environment:
- In Postman, click Import and select the following files:
- Collection:
documentation/postman_collection/QuackCluster_API_Tests.json - Environment:
documentation/postman_collection/QuackCluster_postman_environment.json
-
Activate the Environment:
- In the top-right corner of Postman, select "Quack Cluster Environment" from the environment dropdown list.
-
Send a Request:
- The environment pre-configures the
baseUrlvariable tohttp://127.0.0.1:8000. You can now run any of the pre-built requests in the collection.
- The environment pre-configures the
Quack-Cluster supports a rich subset of the DuckDB SQL dialect, enabling complex analytical queries across multiple files and directories.
- Basic Queries:
SELECT,FROM,WHERE,GROUP BY,ORDER BY,LIMIT. - Aggregate Functions:
COUNT(),SUM(),AVG(),MIN(),MAX(). - Distributed Joins:
INNER JOIN,LEFT JOIN,FULL OUTER JOINacross different file sets. - Advanced SQL:
- Subqueries (e.g.,
WHERE IN (...)) - Common Table Expressions (CTEs) using the
WITHclause. - Window Functions (e.g.,
SUM(...) OVER (PARTITION BY ...)). - Advanced
SELECTsyntax likeSELECT * EXCLUDE (...)andSELECT COLUMNS('<regex>').
- Subqueries (e.g.,
- File System Functions: Query collections of Parquet or CSV files using glob patterns (e.g.,
"s3://my-data/2025/**/*.parquet").
Use these make commands to manage your development lifecycle.
| Command | Description |
|---|---|
make up scale=N |
Starts the cluster with N worker nodes. |
make down |
Stops and removes running containers safely. |
make logs |
Tails the logs from all services. |
make build |
Rebuilds the Docker images after a code change. |
make test |
Runs the pytest suite inside the ray-head container. |
make clean |
DANGER: Stops containers and deletes all data volumes. |
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. Please feel free to open an issue or submit a pull request.
This project leverages AI tools to accelerate development and improve documentation.
All core architectural decisions, debugging, and final testing are human-powered to ensure quality and correctness.
This project is licensed under the MIT License. See the LICENSE file for details.