跨团队连接数据库,无需复制数据或运行服务器
Joining databases across teams without copying data or running servers

原始链接: https://datahike.io/notes/collaborate-without-infrastructure/

## Datahike:一种新的数据共享方法 传统的数据集成依赖于复杂的基础设施,如ETL管道和API,引入了延迟和维护开销。Datahike提供了一个更简单的解决方案:将数据库视为不可变的值。这意味着任何具有存储读取权限的人都可以直接查询数据,无需*移动*数据。 Datahike通过将数据存储为不可变的B树在存储(如S3或文件系统)中实现这一点,利用结构共享——类似于Git——来高效地表示变更。每次读取都会获取一个“分支头”,指向当前的数据库快照,然后按需延迟加载节点。这种“分布式索引空间”允许多个进程独立读取,无需协调。 由于数据库是值,Datahike的Datalog查询语言可以无缝地连接来自不同团队、存储后端,甚至不同时间点的数据——所有这些都在单个查询中完成。这甚至延伸到浏览器通过IndexedDB,实现本地、快速查询和差异同步。本质上,Datahike将复杂性从数据移动和服务器管理转移到不可变数据值的有效存储和查询上。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 在不复制数据或运行服务器的情况下连接团队数据库 (datahike.io) 7点 由 whilo 1小时前 | 隐藏 | 过去 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

March 2026

When two teams need to combine data, the usual answer is infrastructure: an ETL pipeline, an API, a message bus. Each adds latency, maintenance burden, and a new failure mode. The data moves because the systems can’t share it in place.

There’s a simpler model. If your database is an immutable value in storage, then anyone who can read the storage can query it. No server to run, no API to negotiate, no data to copy. And if your query language supports multiple inputs, you can join databases from different teams in a single expression.

This is how Datahike works. It isn’t a feature we bolted on - it intentionally falls out of two properties fundamental to the architecture.

Databases are values

In a traditional database, you query through a connection to a running server. The data may change between queries. The database is a service, not something you hold.

Datahike inverts this. Dereference a connection (@conn) and you get an immutable database value - a snapshot frozen at a specific transaction. It won’t change. Pass it to a function, hold it in a variable, hand it to another thread. Two concurrent readers holding the same snapshot always agree, without locks or coordination.

This is an idea Rich Hickey introduced with Datomic in 2012: separate process (writes, managed by a single writer) from perception (reads, which are just values). The insight was that a correct implementation of perception does not require coordination.

Datomic’s indices live in storage, but its transactor holds an in-memory overlay of recent index segments that haven’t been flushed yet. Readers typically need to coordinate with the transactor to get a complete, current view. The storage alone isn’t enough.

Datahike removes that dependency. The writer flushes to storage on every transaction, so storage is always authoritative. Any process that can read the store sees the full, current database - no overlay, no transactor connection needed. To understand why this works, you need to see how the data is structured.

Trees in storage

Datahike keeps its indices in a persistent sorted set - a B-tree variant where nodes are immutable. Every node is stored as a key-value pair in konserve, which abstracts over storage backends: S3, filesystem, JDBC, IndexedDB.

When a transaction adds data, Datahike doesn’t modify existing nodes. It creates new nodes for the changed path from leaf to root, while the unchanged subtrees are shared with the previous version. This is structural sharing - the same technique behind Clojure’s persistent vectors and Git’s object store.

A concrete example: a database with a million datoms might have a B-tree with thousands of nodes. A transaction that adds ten datoms rewrites perhaps a dozen nodes along the affected paths. The new tree root points to these new nodes and to the thousands of unchanged nodes from before. Both the old and new snapshots are valid, complete trees. They just share most of their structure.

The crucial property: every node is written once and never modified. Its key can be content-addressed. This means nodes can be cached aggressively, replicated independently, and read by any process that has access to the storage - without coordinating with the process that wrote them. (For more on how structural sharing, branching, and the tradeoffs work, see The Git Model for Databases.)

The distributed index space

This is where it comes together.

When you call @conn, Datahike fetches one key from the konserve store: the branch head (e.g. :db). This returns a small map containing root pointers for each index, schema metadata, and the current transaction ID. Nothing else is loaded - the database value you receive is a lazy handle into the tree.

When a query traverses the index, each node is fetched on demand from storage and cached in a local LRU. Subsequent queries hitting the same nodes pay no I/O.

That’s the entire read path. No server process mediating access, no connection protocol, no port to expose. The indices live in storage, and any process that can read the storage can load the branch head, traverse the tree, and run queries. We call this the distributed index space.

Two processes reading the same database fetch the same immutable nodes independently. They don’t know about each other. A writer publishes new snapshots by writing new tree nodes, then atomically updating the branch head. Readers that dereference afterward see the new snapshot. Readers holding an earlier snapshot continue undisturbed - their nodes are immutable and won’t be garbage collected while reachable.

Joining across databases

Because databases are values and Datalog natively supports multiple input sources, the next step is natural: join databases from different teams, different storage backends, or different points in time - in a single query.

Team A maintains a product catalog on S3. Team B maintains inventory on a separate bucket. A third team joins them without either team doing anything:

def catalog := d/connect({:store {:backend :s3, :bucket "team-a"}})
def inventory := d/connect({:store {:backend :s3, :bucket "team-b"}})

d/q('[:find ?name ?price ?stock
       :in $cat $inv
       :where [$cat ?p :product/sku   ?sku]
              [$cat ?p :product/name  ?name]
              [$cat ?p :product/price ?price]
              [$inv ?i :stock/sku     ?sku]
              [$inv ?i :stock/count   ?stock]
              [(> ?stock 0)]], @catalog, @inventory)
(def catalog   (d/connect {:store {:backend :s3 :bucket "team-a"}}))
(def inventory (d/connect {:store {:backend :s3 :bucket "team-b"}}))

(d/q '[:find ?name ?price ?stock
       :in $cat $inv
       :where [$cat ?p :product/sku   ?sku]
              [$cat ?p :product/name  ?name]
              [$cat ?p :product/price ?price]
              [$inv ?i :stock/sku     ?sku]
              [$inv ?i :stock/count   ?stock]
              [(> ?stock 0)]]
  @catalog @inventory)

Each @ dereference fetches a branch head from its respective S3 bucket and returns an immutable database value. The query engine joins them locally. There is no server coordinating between the two, no data copied.

And because both are values, you can mix snapshots from different points in time:

;; Last quarter's catalog crossed with current inventory
def old-catalog := d/as-of(@catalog, #inst "2025-11-01")

d/q('[:find ?name ?stock
       :in $cat $inv
       :where [$cat ?p :product/sku  ?sku]
              [$cat ?p :product/name ?name]
              [$inv ?i :stock/sku    ?sku]
              [$inv ?i :stock/count  ?stock]], old-catalog, @inventory)
;; Last quarter's catalog crossed with current inventory
(def old-catalog (d/as-of @catalog #inst "2025-11-01"))

(d/q '[:find ?name ?stock
       :in $cat $inv
       :where [$cat ?p :product/sku  ?sku]
              [$cat ?p :product/name ?name]
              [$inv ?i :stock/sku    ?sku]
              [$inv ?i :stock/count  ?stock]]
  old-catalog @inventory)

The old snapshot and the current one are both just values. The query engine doesn’t care when they’re from. This is useful for audits, regulatory reproducibility, and debugging: “what would this report have shown against last quarter’s data?”

From storage to browsers

So far, “storage” has meant S3 or a filesystem. But konserve also has an IndexedDB backend, which means the same model works in a browser. Using Kabel WebSocket sync and konserve-sync, a browser client replicates a database locally into IndexedDB. Queries run against the local replica with zero network round-trips. Updates sync differentially - only changed tree nodes are transmitted, the same structural sharing that makes snapshots cheap on the server makes sync cheap over the wire.

Try it

A complete cross-database join, runnable in a Clojure REPL:

require('[datahike.api :as d])

;; Two independent databases
def catalog-cfg := {:store {:backend :memory, :id java.util.UUID/randomUUID()},
                    :schema-flexibility :read}

def inventory-cfg := {:store {:backend :memory, :id java.util.UUID/randomUUID()},
                      :schema-flexibility :read}

d/create-database(catalog-cfg)
d/create-database(inventory-cfg)

def catalog := d/connect(catalog-cfg)
def inventory := d/connect(inventory-cfg)

;; Team A: products
d/transact(catalog,
           [{:product/sku "W001", :product/name "Widget", :product/price 9.99},
            {:product/sku "G002", :product/name "Gadget", :product/price 24.5},
            {:product/sku "T003",
             :product/name "Thingamajig",
             :product/price 3.75}])

;; Team B: stock levels
d/transact(inventory,
           [{:stock/sku "W001", :stock/count 140},
            {:stock/sku "G002", :stock/count 0},
            {:stock/sku "T003", :stock/count 58}])

;; Join: in-stock products with price
d/q('[:find ?name ?price ?stock
       :in $cat $inv
       :where [$cat ?p :product/sku   ?sku]
              [$cat ?p :product/name  ?name]
              [$cat ?p :product/price ?price]
              [$inv ?i :stock/sku     ?sku]
              [$inv ?i :stock/count   ?stock]
              [(> ?stock 0)]], @catalog, @inventory)
;; => #{["Widget" 9.99 140] ["Thingamajig" 3.75 58]}
(require '[datahike.api :as d])

;; Two independent databases
(def catalog-cfg  {:store {:backend :memory
                           :id (java.util.UUID/randomUUID)}
                   :schema-flexibility :read})
(def inventory-cfg {:store {:backend :memory
                            :id (java.util.UUID/randomUUID)}
                    :schema-flexibility :read})

(d/create-database catalog-cfg)
(d/create-database inventory-cfg)

(def catalog  (d/connect catalog-cfg))
(def inventory (d/connect inventory-cfg))

;; Team A: products
(d/transact catalog
  [{:product/sku "W001" :product/name "Widget"      :product/price 9.99}
   {:product/sku "G002" :product/name "Gadget"      :product/price 24.50}
   {:product/sku "T003" :product/name "Thingamajig" :product/price 3.75}])

;; Team B: stock levels
(d/transact inventory
  [{:stock/sku "W001" :stock/count 140}
   {:stock/sku "G002" :stock/count 0}
   {:stock/sku "T003" :stock/count 58}])

;; Join: in-stock products with price
(d/q '[:find ?name ?price ?stock
       :in $cat $inv
       :where [$cat ?p :product/sku   ?sku]
              [$cat ?p :product/name  ?name]
              [$cat ?p :product/price ?price]
              [$inv ?i :stock/sku     ?sku]
              [$inv ?i :stock/count   ?stock]
              [(> ?stock 0)]]
  @catalog @inventory)
;; => #{["Widget" 9.99 140] ["Thingamajig" 3.75 58]}

Replace :memory with :s3, :file, or :jdbc and the same code works across storage backends. The databases don’t need to share a backend - join an S3 database against a local file store in the same query.

联系我们 contact @ memedata.com