分布式 DuckDB 实例
Distributed DuckDB Instance

原始链接: https://github.com/citguru/openduck

## OpenDuck:为DuckDB提供的开源云能力 OpenDuck 将 MotherDuck 的创新云架构——差异化存储、混合执行和透明远程数据库——带到开源世界。它允许用户使用简单的 `ATTACH` 语句(例如 `ATTACH 'openduck:mydb' AS cloud;`)无缝查询远程存储的数据,就像数据存储在本地 DuckDB 实例中一样。 主要特性包括:基于快照的分层存储,利用对象存储和 PostgreSQL 元数据;以及混合查询引擎,可智能地在本地和远程工作线程之间拆分执行。这通过基于 gRPC 和 Arrow IPC 的最小化开放协议实现,从而实现后端灵活性——任何返回 Arrow 的服务都可以使用。 OpenDuck 作为 DuckDB 扩展实现,并包含用于身份验证、路由和计划拆分的 Rust 网关。它提供通过 Python 的直接连接,并致力于与 DuckDB 的目录无缝集成,将远程表视为一流公民。虽然灵感来自 MotherDuck,但 OpenDuck 不具备线缆兼容性,并提供完全开源的替代方案。

分布式 DuckDB 实例 (github.com/citguru) 14 分 由 citguru 发表于 30 分钟前 | 隐藏 | 过去 | 收藏 | 1 条评论 帮助 citguru 发表于 30 分钟前 [–] 这是尝试复制 MotherDuck 的差异存储并实现在 DuckDB 上混合查询执行。 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

An open-source implementation of the ideas pioneered by MotherDuck — differential storage, hybrid (dual) execution, and transparent remote databases for DuckDB — available for anyone to run, extend, and build on.

MotherDuck showed that DuckDB can work beautifully in the cloud: ATTACH 'md:mydb', and remote tables appear local. Queries split transparently across your laptop and the cloud. Storage is layered and snapshot-based. OpenDuck takes those architectural ideas — differential storage, dual execution, the attach-based UX — and makes them open. Open protocol, open backend, open extension.

import duckdb

con = duckdb.connect()
con.execute("LOAD 'openduck';")
con.execute("ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token=xxx' AS cloud;")

con.sql("SELECT * FROM cloud.users").show()                    # remote, transparent
con.sql("SELECT * FROM local.t JOIN cloud.t2 ON ...").show()   # hybrid, one query

# direct connect using openduck python library

con = openduck.connect("od:mydb")
con = openduck.connect("openduck:myd")

# direct connect using duckb (TODO: needs duckdb to autoload openduck the same way motherduck works today)

con = duckdb.connect("od:mydb")
con = duckdb.connect("openduck:myd")

Append-only layers with PostgreSQL metadata. DuckDB sees a normal file; OpenDuck persists data as immutable sealed layers addressable from object storage. Snapshots give you consistent reads. One serialized write path, many concurrent readers.

A single query can run partly on your machine and partly on a remote worker. The gateway splits the plan, labels each operator LOCAL or REMOTE, and inserts bridge operators at the boundaries. Only intermediate results cross the wire.

[LOCAL]  HashJoin(l.id = r.id)
  [LOCAL]  Scan(products)          ← your laptop
  [LOCAL]  Bridge(R→L)
    [REMOTE] Scan(sales)           ← remote worker

The extension implements DuckDB's StorageExtension and Catalog interfaces. Remote tables are first-class catalog entries, they participate in JOINs, CTEs, and the optimizer like local tables.

OpenDuck's protocol is intentionally minimal: two RPCs defined in execution.proto. The first one to execute a query, and the other to stream results back as Arrow IPC batches.

Because the protocol is open and simple, you're not locked into a single backend. Any service that speaks gRPC and returns Arrow can serve as an OpenDuck-compatible backend. Run the included Rust gateway, replace it with your own implementation, or plug in an entirely different execution engine — the client and extension don't care what's on the other side.

┌─────────────────────────────────────────────┐
│  DuckDB process (client)                    │
│                                             │
│  LOAD openduck                              │
│  ATTACH 'openduck:mydb' AS cloud            │
│                                             │
│  ┌─────────────────────────────────────┐    │
│  │ OpenDuckCatalog                     │    │
│  │  └─ OpenDuckSchemaEntry             │    │
│  │      └─ OpenDuckTableEntry (users)  │    │
│  │      └─ OpenDuckTableEntry (events) │    │
│  └──────────────┬──────────────────────┘    │
│                 │ gRPC + Arrow IPC          │
└─────────────────┼───────────────────────────┘
                  │
      ┌───────────▼───────────┐
      │  Gateway (Rust)       │
      │  - auth, routing      │
      │  - plan splitting     │     ┌──────────────┐
      │  - backpressure       │────▶│  Worker 1    │
      │                       │     │  (DuckDB)    │
      │                       │     └──────────────┘
      │                       │     ┌──────────────┐
      │                       │────▶│  Worker N    │
      │                       │     │  (DuckDB)    │
      └───────────────────────┘     └──────────────┘
              │
    ┌─────────┴─────────┐
    ▼                   ▼
┌──────────┐    ┌──────────────┐
│ Postgres │    │ Object store │
│ metadata │    │ sealed layers│
└──────────┘    └──────────────┘
# Backend
cargo build --workspace
cargo run -p openduck -- serve -d mydb -t your-token

# Extension (requires vcpkg — see extensions/openduck/README.md)
cd extensions/openduck && make

# Python client
pip install -e clients/python
export OPENDUCK_TOKEN=your-token
python -c "
import openduck
con = openduck.connect('mydb')
con.sql('SELECT 1 AS x').show()
"
crates/
  exec-gateway/     Gateway — auth, routing, hybrid plan splitting
  exec-worker/      Worker — embedded DuckDB, Arrow IPC streaming
  exec-proto/       Protobuf/tonic codegen
  openduck-cli/     Unified CLI (openduck serve|gateway|worker)
  diff-*/           Differential storage pipeline (layers, metadata, FUSE)

extensions/
  openduck/ DuckDB C++ extension (StorageExtension + Catalog)

clients/
  python/           openduck Python package (pip install -e clients/python)

proto/
  openduck/v1/      Protocol definition (execution.proto)

MotherDuck is a commercial cloud service. OpenDuck is an open-source project inspired by its architecture.

MotherDuck OpenDuck
What Managed cloud service Self-hosted open-source
Attach scheme md: openduck: / od:
Auth motherduck_token OPENDUCK_TOKEN
Differential storage Proprietary Open (Postgres metadata + object store)
Hybrid execution Proprietary planner Open (gateway + plan splitting)
Protocol Private wire format Open gRPC + Arrow IPC
Backend MotherDuck's cloud Anything implementing ExecutionService
Extension Bundled in DuckDB Separate loadable extension

OpenDuck is not wire-compatible with MotherDuck. It reimplements the same architectural ideas as an open protocol.

OpenDuck vs Arrow Flight SQL

Arrow Flight SQL is a generic database protocol — "JDBC/ODBC over Arrow." OpenDuck is a DuckDB-specific system with a narrower scope but deeper integration.

Arrow Flight SQL OpenDuck
Scope Any SQL database DuckDB-specific
Integration Separate client driver DuckDB StorageExtension + Catalog
Catalog Server-side (GetTables, etc.) Extension-side (DuckDB catalog entries)
Execution Full query on server Hybrid — split across local and remote
Protocol surface ~15 RPCs 2 RPCs
Plan format SQL only SQL (M2), structured plan IR (M3)
Optimizer Client-side, unaware DuckDB optimizer sees remote tables natively

OpenDuck's architecture draws heavily from MotherDuck's published work on differential storage, dual execution, and cloud-native DuckDB. Credit to the MotherDuck team for pioneering these ideas.

MIT

联系我们 contact @ memedata.com