March 2026
When two teams need to combine data, the usual answer is infrastructure: an ETL pipeline, an API, a message bus. Each adds latency, maintenance burden, and a new failure mode. The data moves because the systems can’t share it in place.
There’s a simpler model. If your database is an immutable value in storage, then anyone who can read the storage can query it. No server to run, no API to negotiate, no data to copy. And if your query language supports multiple inputs, you can join databases from different teams in a single expression.
This is how Datahike works. It isn’t a feature we bolted on - it intentionally falls out of two properties fundamental to the architecture.
Databases are values
In a traditional database, you query through a connection to a running server. The data may change between queries. The database is a service, not something you hold.
Datahike inverts this. Dereference a connection (@conn) and you get an immutable database value - a snapshot frozen at a specific transaction. It won’t change. Pass it to a function, hold it in a variable, hand it to another thread. Two concurrent readers holding the same snapshot always agree, without locks or coordination.
This is an idea Rich Hickey introduced with Datomic in 2012: separate process (writes, managed by a single writer) from perception (reads, which are just values). The insight was that a correct implementation of perception does not require coordination.
Datomic’s indices live in storage, but its transactor holds an in-memory overlay of recent index segments that haven’t been flushed yet. Readers typically need to coordinate with the transactor to get a complete, current view. The storage alone isn’t enough.
Datahike removes that dependency. The writer flushes to storage on every transaction, so storage is always authoritative. Any process that can read the store sees the full, current database - no overlay, no transactor connection needed. To understand why this works, you need to see how the data is structured.
Trees in storage
Datahike keeps its indices in a persistent sorted set - a B-tree variant where nodes are immutable. Every node is stored as a key-value pair in konserve, which abstracts over storage backends: S3, filesystem, JDBC, IndexedDB.
When a transaction adds data, Datahike doesn’t modify existing nodes. It creates new nodes for the changed path from leaf to root, while the unchanged subtrees are shared with the previous version. This is structural sharing - the same technique behind Clojure’s persistent vectors and Git’s object store.
A concrete example: a database with a million datoms might have a B-tree with thousands of nodes. A transaction that adds ten datoms rewrites perhaps a dozen nodes along the affected paths. The new tree root points to these new nodes and to the thousands of unchanged nodes from before. Both the old and new snapshots are valid, complete trees. They just share most of their structure.
The crucial property: every node is written once and never modified. Its key can be content-addressed. This means nodes can be cached aggressively, replicated independently, and read by any process that has access to the storage - without coordinating with the process that wrote them. (For more on how structural sharing, branching, and the tradeoffs work, see The Git Model for Databases.)
The distributed index space
This is where it comes together.
When you call @conn, Datahike fetches one key from the konserve store: the branch head (e.g. :db). This returns a small map containing root pointers for each index, schema metadata, and the current transaction ID. Nothing else is loaded - the database value you receive is a lazy handle into the tree.
When a query traverses the index, each node is fetched on demand from storage and cached in a local LRU. Subsequent queries hitting the same nodes pay no I/O.
That’s the entire read path. No server process mediating access, no connection protocol, no port to expose. The indices live in storage, and any process that can read the storage can load the branch head, traverse the tree, and run queries. We call this the distributed index space.
Two processes reading the same database fetch the same immutable nodes independently. They don’t know about each other. A writer publishes new snapshots by writing new tree nodes, then atomically updating the branch head. Readers that dereference afterward see the new snapshot. Readers holding an earlier snapshot continue undisturbed - their nodes are immutable and won’t be garbage collected while reachable.
Joining across databases
Because databases are values and Datalog natively supports multiple input sources, the next step is natural: join databases from different teams, different storage backends, or different points in time - in a single query.
Team A maintains a product catalog on S3. Team B maintains inventory on a separate bucket. A third team joins them without either team doing anything:
def catalog := d/connect({:store {:backend :s3, :bucket "team-a"}})
def inventory := d/connect({:store {:backend :s3, :bucket "team-b"}})
d/q('[:find ?name ?price ?stock
:in $cat $inv
:where [$cat ?p :product/sku ?sku]
[$cat ?p :product/name ?name]
[$cat ?p :product/price ?price]
[$inv ?i :stock/sku ?sku]
[$inv ?i :stock/count ?stock]
[(> ?stock 0)]], @catalog, @inventory)
(def catalog (d/connect {:store {:backend :s3 :bucket "team-a"}}))
(def inventory (d/connect {:store {:backend :s3 :bucket "team-b"}}))
(d/q '[:find ?name ?price ?stock
:in $cat $inv
:where [$cat ?p :product/sku ?sku]
[$cat ?p :product/name ?name]
[$cat ?p :product/price ?price]
[$inv ?i :stock/sku ?sku]
[$inv ?i :stock/count ?stock]
[(> ?stock 0)]]
@catalog @inventory)
Each @ dereference fetches a branch head from its respective S3 bucket and returns an immutable database value. The query engine joins them locally. There is no server coordinating between the two, no data copied.
And because both are values, you can mix snapshots from different points in time:
;; Last quarter's catalog crossed with current inventory
def old-catalog := d/as-of(@catalog, #inst "2025-11-01")
d/q('[:find ?name ?stock
:in $cat $inv
:where [$cat ?p :product/sku ?sku]
[$cat ?p :product/name ?name]
[$inv ?i :stock/sku ?sku]
[$inv ?i :stock/count ?stock]], old-catalog, @inventory)
;; Last quarter's catalog crossed with current inventory
(def old-catalog (d/as-of @catalog #inst "2025-11-01"))
(d/q '[:find ?name ?stock
:in $cat $inv
:where [$cat ?p :product/sku ?sku]
[$cat ?p :product/name ?name]
[$inv ?i :stock/sku ?sku]
[$inv ?i :stock/count ?stock]]
old-catalog @inventory)
The old snapshot and the current one are both just values. The query engine doesn’t care when they’re from. This is useful for audits, regulatory reproducibility, and debugging: “what would this report have shown against last quarter’s data?”
From storage to browsers
So far, “storage” has meant S3 or a filesystem. But konserve also has an IndexedDB backend, which means the same model works in a browser. Using Kabel WebSocket sync and konserve-sync, a browser client replicates a database locally into IndexedDB. Queries run against the local replica with zero network round-trips. Updates sync differentially - only changed tree nodes are transmitted, the same structural sharing that makes snapshots cheap on the server makes sync cheap over the wire.
Try it
A complete cross-database join, runnable in a Clojure REPL:
require('[datahike.api :as d])
;; Two independent databases
def catalog-cfg := {:store {:backend :memory, :id java.util.UUID/randomUUID()},
:schema-flexibility :read}
def inventory-cfg := {:store {:backend :memory, :id java.util.UUID/randomUUID()},
:schema-flexibility :read}
d/create-database(catalog-cfg)
d/create-database(inventory-cfg)
def catalog := d/connect(catalog-cfg)
def inventory := d/connect(inventory-cfg)
;; Team A: products
d/transact(catalog,
[{:product/sku "W001", :product/name "Widget", :product/price 9.99},
{:product/sku "G002", :product/name "Gadget", :product/price 24.5},
{:product/sku "T003",
:product/name "Thingamajig",
:product/price 3.75}])
;; Team B: stock levels
d/transact(inventory,
[{:stock/sku "W001", :stock/count 140},
{:stock/sku "G002", :stock/count 0},
{:stock/sku "T003", :stock/count 58}])
;; Join: in-stock products with price
d/q('[:find ?name ?price ?stock
:in $cat $inv
:where [$cat ?p :product/sku ?sku]
[$cat ?p :product/name ?name]
[$cat ?p :product/price ?price]
[$inv ?i :stock/sku ?sku]
[$inv ?i :stock/count ?stock]
[(> ?stock 0)]], @catalog, @inventory)
;; => #{["Widget" 9.99 140] ["Thingamajig" 3.75 58]}
(require '[datahike.api :as d])
;; Two independent databases
(def catalog-cfg {:store {:backend :memory
:id (java.util.UUID/randomUUID)}
:schema-flexibility :read})
(def inventory-cfg {:store {:backend :memory
:id (java.util.UUID/randomUUID)}
:schema-flexibility :read})
(d/create-database catalog-cfg)
(d/create-database inventory-cfg)
(def catalog (d/connect catalog-cfg))
(def inventory (d/connect inventory-cfg))
;; Team A: products
(d/transact catalog
[{:product/sku "W001" :product/name "Widget" :product/price 9.99}
{:product/sku "G002" :product/name "Gadget" :product/price 24.50}
{:product/sku "T003" :product/name "Thingamajig" :product/price 3.75}])
;; Team B: stock levels
(d/transact inventory
[{:stock/sku "W001" :stock/count 140}
{:stock/sku "G002" :stock/count 0}
{:stock/sku "T003" :stock/count 58}])
;; Join: in-stock products with price
(d/q '[:find ?name ?price ?stock
:in $cat $inv
:where [$cat ?p :product/sku ?sku]
[$cat ?p :product/name ?name]
[$cat ?p :product/price ?price]
[$inv ?i :stock/sku ?sku]
[$inv ?i :stock/count ?stock]
[(> ?stock 0)]]
@catalog @inventory)
;; => #{["Widget" 9.99 140] ["Thingamajig" 3.75 58]}
Replace :memory with :s3, :file, or :jdbc and the same code works across storage backends. The databases don’t need to share a backend - join an S3 database against a local file store in the same query.