DuckDB 不需要数据就可以成为数据库
DuckDB Doesn't Need Data to Be a Database

原始链接: https://www.nikolasgoebel.com/2024/05/28/duckdb-doesnt-need-data.html

简单来说,像 DuckDB 这样的数据库允许用户通过提供如何访问远程数据的指令来创建虚拟数据库,而不是存储数据本身。 例如,想象一下通过存储在 Amazon S3 中的机器人出租车服务管理日常乘坐模式的大型数据集。 由于大小限制和即时过时,通过电子邮件与分析师共享这些数据集或发送多个指向单个文件的链接变得不切实际。 相反,在本地为接收者创建一个轻量级数据库,例如 DuckDB,它连接到远程 S3 数据并使用给定的 SQL 命令提取特定信息。 通过创建虚拟数据库,您可以为用户提供连接数据并与数据交互的单一链接,从而降低复杂性并提高效率。 针对该虚拟数据库的查询将所需的数据提供给用户,而无需下载整个远程数据。 此外,无论底层数据结构或组织是否可能发生变化,这种方法都可以保持兼容性。 本质上,通过 DuckDB 作为网关,可以通过超链接轻松访问数据云中的关系数据集。

一个团队正在开发一个表单生成器软件即服务 (SaaS)。 对于较小的表单,用户可以通过无限滚动来浏览响应,并根据需要加载数据。 然而,较大的表单可能包含超过 50,000 个响应,从而导致大量的网络使用和内存压力。 通过将这些大型响应集作为 Parquet 文件存储在简单存储服务 (S3) 中,团队利用 DuckDB 和 DuckDB-WASM 在本地加载整个数据集。 用户与本地数据库交互,而不是在滚动时从服务器获取行。 这简化了流程,提高了效率并确保即使数据负载很大也能保持一致的性能。 对于处理中等大小数据的服务,请考虑尝试使用此方法来优化网络带宽并减少内存需求。
相关文章

原文

One of the many enjoyable things about databases is that they generally try to separate how data is represented internally (say on disk) from how it is used. To the point that it has become the norm to not even store the data on the same hardware that is running the queries.

Databases have gotten so good at this, that the term is almost misleading now. “Base” suggests something rigid, without which the data would slip away. But the data is always there, just bits on a nameless hard disk. The structure and the accessibility that a modern database provides exist completely independently from that hard disk. That’s right – most databases no longer have any data in them.

DuckDB is a database built for this age, and a particularly lovely one at that.

Say you run a robotaxi service. And you’re maintaining a growing dataset of interesting ride patterns in blob storage, that you’d like to understand better. The data is split into a separate Parquet file for each day. How do you share that dataset with an analyst?

You could just mail them everything, but the dataset is getting way too big for that. Plus it’d be out of date instantly. If this was a blogpost you would just send a link, but there isn’t really a single thing to link to here. And sending someone a hundred links to raw blobs in S3 can sour any working relationship.

Instead you whip up a little database, just for them:

# Send
import duckdb
db = duckdb.connect("weird_rides.db")
db.sql("""
    CREATE VIEW weird_rides
    AS SELECT pickup_at, dropoff_at, trip_distance, total_amount
    FROM 's3://robotaxi-inc/daily-ride-data/*.parquet'
    WHERE fare_amount > 100 AND trip_distance < 10.0
""")
db.close()

This produces a tiny file called weird_rides.db. It contains none of the actual data. What it does contain are the above instructions for how to work with this pile of blobs, as if it were a single table of relevant data points.

You publish this database file to blob storage as well, right next to your data. Now you have a single thing to link to: s3://robotaxi-inc/virtual-datasets/weird_rides.db.

Instead of opening a web browser, the receipient of the link starts their own local DuckDB session, and attaches to the referenced database.

# Receive
import duckdb
conn = duckdb.connect()
conn.sql("""
ATTACH 's3://robotaxi-inc/virtual-datasets/weird_rides.db'
AS rides_db (READ_ONLY)
""")
conn.sql("SELECT * FROM rides_db.weird_rides LIMIT 5")
# ┌─────────────────────┬─────────────────────┬───────────────┬──────────────┐
# │      pickup_at      │     dropoff_at      │ trip_distance │ total_amount │
# │      timestamp      │      timestamp      │     float     │    float     │
# ├─────────────────────┼─────────────────────┼───────────────┼──────────────┤
# │ 2019-04-01 00:03:20 │ 2019-04-01 00:03:54 │           0.0 │       240.35 │
# │ 2019-04-01 00:16:16 │ 2019-04-01 00:16:21 │           0.0 │       138.36 │
# │ 2019-04-01 02:01:22 │ 2019-04-01 02:01:28 │           0.0 │       192.96 │
# │ 2019-04-01 06:26:44 │ 2019-04-01 06:27:14 │           0.0 │        115.3 │
# │ 2019-04-01 07:28:12 │ 2019-04-01 07:28:12 │           0.0 │        127.2 │
# └─────────────────────┴─────────────────────┴───────────────┴──────────────┘

This query is the first time that any data has to be downloaded from S3. DuckDB supports partial reading, which means that only the columns used in the definition of the weird_rides view have to be fetched, and that filters like fare_amount > 100 can be used to discard even more irrelevant data during the download.

From the perspective of the receiver, this way of accessing the data will remain working unchanged, almost no matter what happens to the underlying data. Changes in format, different partitioning strategies, schema changes – through all of it the receiver’s view remains the same.

With DuckDB as a browser for the data cloud, relational datasets are always just a hyperlink away.

联系我们 contact @ memedata.com