DuckDB 作为新的 jq

DuckDB 作为新的 jq
DuckDB as the New jq

原始链接: https://www.pgrs.net/2024/03/21/duckdb-as-the-new-jq/

本文是软件工程师 Paul Gross 撰写的个人博客文章。他讨论了他对 DuckDB 项目的兴趣，该项目使他能够通过其内置功能高效、简单地处理 JSON 数据，而不是使用 jq 等复杂的工具。凭借使用 JSON 的丰富经验，他发现通过他已经熟悉的 DuckDB 使用 SQL 查询的便利性比学习 jq 强大而复杂的语法更有优势。此外，他还解释了如何使用 DuckDB 和 jq 从从 GitHub 获取的 JSON 文件中提取有关开源许可信息的统计信息，并比较它们的可用性。此外，他提到 DuckDB 支持导入除 JSON 之外的各种数据格式，例如 CSV 和 Parquet，而不需要持久存储。最后，他指出该方法也适用于 URL。

我发现 Talend、Trifacta、Alteryx 和 Easy Data Transform 等 ETL 工具对于处理各种数据源和格式非常有帮助。这些工具允许您直观地操作和清理数据，执行复杂的转换，并将结果加载到数据库、电子表格或其他格式中，而无需编写大量的自定义脚本。虽然商业产品的许可成本可能会阻止一些用户，但开源替代方案也存在，为探索其功能提供了合理的切入点。其中一些流行的选项包括 Talend Open Studio、Trifacta Community Edition 和 Easy Data Transform。这些解决方案还与 AWS Glue、Microsoft Azure Databricks 和 Google Cloud Data Engineering 等云平台完美集成。使用这些工具可以让您利用其内置库、高级转换功能和拖放界面来执行高效的数据工程任务。此外，许多支持扩展和插件，进一步扩展了它们的功能。通过使用这些工具，您可以降低复杂性，节省开发时间，并更多地关注业务逻辑和分析，而不是迷失在复杂的数据整理任务中。

原文

Recently, I’ve been interested in the DuckDB project (like a SQLite geared towards data applications). And one of the amazing features is that it has many data importers included without requiring extra dependencies. This means it can natively read and parse JSON as a database table, among many other formats.

I work extensively with JSON day to day, and I often reach for jq when exploring documents. I love jq, but I find it hard to use. The syntax is super powerful, but I have to study the docs anytime I want to do anything beyond just selecting fields.

Once I learned DuckDB could read JSON files directly into memory, I realized that I could use it for many of the things where I’m currently using jq. In contrast to the complicated and custom jq syntax, I’m very familiar with SQL and use it almost daily.

Here’s an example:

First, we fetch some sample JSON to play around with. I used the GitHub API to grab the repository information from the golang org:

% curl 'https://api.github.com/orgs/golang/repos' > repos.json

Now, as a sample question to answer, let’s get some stats on the types of open source licenses used.

The JSON structure looks like this:

[
  {
    "id": 1914329,
    "name": "gddo",
    "license": {
      "key": "bsd-3-clause",
      "name": "BSD 3-Clause \"New\" or \"Revised\" License",
      ...
    },
    ...
  },
  {
    "id": 11440704,
    "name": "glog",
    "license": {
      "key": "apache-2.0",
      "name": "Apache License 2.0",
      ...
    },
    ...
  },
  ...
]

This might not be the best way, but here is what I cobbled together after searching and reading some docs for how to do this in jq:

 % cat repos.json | jq \
  'group_by(.license.key)
  | map({license: .[0].license.key, count: length})
  | sort_by(.count)
  | reverse'
[
  {
    "license": "bsd-3-clause",
    "count": 23
  },
  {
    "license": "apache-2.0",
    "count": 5
  },
  {
    "license": null,
    "count": 2
  }
]

And here is what it looks like in DuckDB using SQL:

% duckdb -c \
  "select license->>'key' as license, count(*) as count \
  from 'repos.json' \
  group by 1 \
  order by count desc"
┌──────────────┬───────┐
│   license    │ count │
│   varchar    │ int64 │
├──────────────┼───────┤
│ bsd-3-clause │    23 │
│ apache-2.0   │     5 │
│              │     2 │
└──────────────┴───────┘

For me, this SQL is much simpler and I was able to write it without looking at any docs. The only tricky part is querying nested JSON with the ->> operator. The syntax is the same as the PostgreSQL JSON Functions, however, so I was familiar with it.

And if we do need the output in JSON, there’s a DuckDB flag for that:

% duckdb -json -c \
  "select license->>'key' as license, count(*) as count \
  from 'repos.json' \
  group by 1 \
  order by count desc"
[{"license":"bsd-3-clause","count":23},
{"license":"apache-2.0","count":5},
{"license":null,"count":2}]

We can still even pretty print with jq at the end, after using DuckDB to do the heavy lifting:

% duckdb -json -c \
  "select license->>'key' as license, count(*) as count \
  from 'repos.json' \
  group by 1 \
  order by count desc" \
  | jq
[
  {
    "license": "bsd-3-clause",
    "count": 23
  },
  {
    "license": "apache-2.0",
    "count": 5
  },
  {
    "license": null,
    "count": 2
  }
]

JSON is just one of the many ways of importing data into DuckDB. This same approach would work for CSVs, parquet, Excel files, etc.

And I could choose to create tables and persist locally, but often I’m just interrogating data and don’t need the persistence.

Read more about DuckDB’s great JSON support in this blog post: Shredding Deeply Nested JSON, One Vector at a Time

Update:

I also learned that DuckDB can read the JSON directly from a URL, not just a local file:

% duckdb -c \
  "select license->>'key' as license, count(*) as count \
  from read_json('https://api.github.com/orgs/golang/repos') \
  group by 1 \
  order by count desc"

DuckDB 作为新的 jq DuckDB as the New jq

DuckDB 作为新的 jq
DuckDB as the New jq