为什么 DuckDB 是我数据处理的首选
Why DuckDB is my first choice for data processing

原始链接: https://www.robinlinacre.com/recommend_duckdb/

## DuckDB:现代数据处理强劲引擎 DuckDB 正在迅速成为作者首选的数据处理工具,尤其是在 Python 中,反映了大多数表格数据分析向单机分析的转变。这款开源、进程内的 SQL 引擎擅长分析查询——连接、聚合,性能通常比 SQLite 或 PostgreSQL 等替代方案快 100-1000 倍。 其主要优势在于简单性:易于安装(单个二进制文件,可通过 pip 安装),符合人体工程学的 SQL 方言,具有创新的特性,如 `EXCLUDE` 和函数链,以及近乎零的启动时间——非常适合 CI/CD 和快速测试。DuckDB 可以直接查询各种格式的数据(CSV、Parquet、JSON,甚至网络 URL),并支持使用公共表表达式 (CTE) 进行复杂的管道处理,方便调试。 除了速度之外,DuckDB 还为批量操作提供完整的 ACID 兼容性,可能与湖仓格式相媲美,并允许通过 C++ 扩展实现高性能的自定义函数。其不断增长的生态系统,包括 PostgreSQL 集成,使其成为现代数据工作流程中一种多功能且日益强大的解决方案。作者使用 Splink 的经验表明,DuckDB 能够提高采用率,减少用户问题,并加速开发。

一个黑客新闻的讨论强调了 DuckDB 作为数据处理的理想选择,特别是对于开发者而言。最初的文章(来自 robinlinacre.com)引发了评论员的共鸣,许多人赞扬了它的多功能性。 提到的主要优点包括它能够使用 SQL 轻松查询 CSV 和 JSON 文件,在连接和聚合数据等更复杂任务上超越了像 `awk` 这样工具的简单性。DuckDB 的可嵌入性——甚至可以通过 WebAssembly 在 Web 浏览器中使用——为强大的应用内分析和 Marimo 笔记本等工具打开了可能性。 一位用户建议 DuckDB 可能是 Citus 在 PostgreSQL 中提供列式支持的一个更简单的替代方案。另一位用户提到正在进行的 Android 构建,提出了一个类似于 SQLite 的 Java API 的可能性。总的来说,这次讨论展示了 DuckDB 作为一种强大而灵活的数据处理解决方案日益增长的受欢迎程度。
相关文章

原文

Over the past few years, I've found myself using DuckDB more and more for data processing, to the point where I now use it almost exclusively, usually from within Python.

We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets.2

This post sets out some of my favourite features of DuckDB that set it apart from other SQL-based tools. In a nutshell, it's simple to install, ergonomic, fast, and more fully featured.

An earlier post explains why I favour SQL over other APIs such as Polars, pandas or dplyr.

DuckDB is an open source in-process SQL engine that is optimised for analytics queries.

  • 'In-process' means it's similar to SQLite in that it runs within your application. You don't need to start a separate service such as Postgres to run it.
  • 'Optimised for analytics queries' means that it's designed for operations like joins and aggregations involving large numbers of rows, as opposed to atomic transactions.

The performance difference of analytics-optimised engines (OLAP) vs. transactions-optimised engines (OLTP) should not be underestimated. A query running in DuckDB can be 100 or even 1,000 times faster than exactly the same query running in (say) SQLite or Postgres.

A core use-case of DuckDB is where you have one or more large datasets on disk in formats like csv, parquet or json which you want to batch process. You may want to perform cleaning, joins, aggregation, derivation of new columns - that sort of thing.

But you can also use DuckDB for many other simpler tasks like viewing a csv file from the command line.

DuckDB consistently benchmarks as one of the fastest data processing engines. The benchmarks I've seen3 show there's not much in it between the leading open source engines - which at the moment seem to be polars, DuckDB, DataFusion, Spark and Dask. Spark and Dask can be competitive on large data, but slower on small data.

DuckDB itself is a single precompiled binary. In Python, it can be pip installed with no dependencies. This makes it a joy to install compared to other more heavyweight options like Spark. Combined with uv, you can stand up a fresh DuckDB Python environment from nothing in less than a second - see here.

With its speed and almost-zero startup time, DuckDB is ideally suited for CI and testing of data engineering pipelines.

Historically this has been fiddly and running a large suite of tests in e.g. Apache Spark has been time consuming and frustrating. Now it's much simpler to set up the test environment, and there's less scope for differences between it and your production pipelines.

This simplicity and speed also applies to writing new SQL, and getting syntax right before running it on a large dataset. Historically I have found this annoying in engines like Spark (where it takes a few seconds to start Spark in local mode), or even worse when you're forced to run queries in a proprietary tool like AWS Athena.4

There's even a DuckDB UI with autocomplete - see here.

The DuckDB team has implemented a wide range of innovations in its SQL dialect that make it a joy to use. See the following blog posts 1 2 3 4 5 6.

Some of my favourites are the EXCLUDE keyword, and the COLUMNS keyword which allows you to select and regex-replace a subset of columns.5 I also like QUALIFY and the aggregate modifiers on window functions, see here.

Another is the ability to function chain, like first_name.lower().trim().

You can query data directly from files, including on s3, or on the web.

For example to query a folder of parquet files:

select *

from read_parquet('path/to/*.parquet')

or even (on CORS enabled files) you can run SQL directly:

select *

from read_parquet('https://raw.githubusercontent.com/plotly/datasets/master/2015_flights.parquet')

limit 2;

Click here to try this query yourself in the DuckDB web shell.

One of the easiest ways to cause problems in your data pipelines is to fail to be strict about incoming data types from untyped formats such as csv. DuckDB provides lots of options here - see here.

Many data pipelines effectively boil down to a long sequence of CTEs:

WITH

input_data AS (

SELECT * FROM read_parquet('...')

),

step_1 AS (

SELECT ... FROM input_data JOIN ...

),

step_2 AS (

SELECT ... FROM step_1

)

SELECT ... FROM step_2;

When developing a pipeline like this, we often want to inspect what's happened at each step.

In Python, we can write

input_data = duckdb.sql("SELECT * FROM read_parquet('...')")

step_1 = duckdb.sql("SELECT ... FROM input_data JOIN ...")

step_2 = duckdb.sql("SELECT ... FROM step_1")

final = duckdb.sql("SELECT ... FROM step_2;")

This makes it easy to inspect what the data looks like at step_2 with no performance loss, since these steps will be executed lazily when they're run all at once.

This also facilitates easier testing of SQL in CI, since each step can be an independently-tested function.

DuckDB offers full ACID compliance for bulk data operations, which sets it apart from other analytical data systems - see here. You can listen to more about this on in this podcast, transcribed here.

This is a very interesting new development, making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data.

A longstanding difficulty with data processing engines has been the difficulty in writing high performance user defined functions (UDFs).

For example, in PySpark, you will generally get best performance by writing custom Scala, compiling to a JAR, and registering it with Spark. But this is cumbersome and in practice, you will encounter a lot of issues around Spark version compatibility and security restrictions environments such as DataBricks.

In DuckDB high performance custom UDFs can be written in C++. Whilst writing these functions is certainly not trivial, DuckDB community extensions offers a low-friction way of distributing the code. Community extensions can be installed almost instantly with a single command such as INSTALL h3 FROM community to install hierarchical hexagonal indexing for geospatial data.

The team provide documentation as a single markdown file so it can easily be provided to an LLM.

My top tip: if you load this file in your code editor, and use code folding, it's easy to copy the parts of the documentation you need into context.

Much of this blog post is based on my experience supporting multiple SQL dialects in Splink, an open source library for record linkage at scale. We've found that transitioning towards recommending DuckDB as the default backend choice has increased adoption of the library and significantly reduced the amount of problems faced by users, even for large linkage tasks, whilst speeding up workloads very substantially.

We've also found it's hugely increased the simplicity and speed of developing and testing new features.

  • The PosgreSQL Extension allows you to attach a Postgres database and query it directly from DuckDB.
  • pg_duckdb allows you to embed the DuckDB computation engine within Postgres.

The later in particular seems potentially extremely powerful, enabling Postgres to be simultanouesly optimised for analytics and transactional processing. I think it's likely to see widespread adoption, especially after they iron out a few of the current shortcomings around enabling and optimising the use of Postgres indexes and pushing up filters up to PostGres.

联系我们 contact @ memedata.com