预览:DuckDB 中的 Amazon S3 表和湖仓
Preview: Amazon S3 Tables and Lakehouse in DuckDB

原始链接: https://duckdb.org/2025/03/14/preview-amazon-s3-tables.html

DuckDB于2025年3月14日宣布了一项预览功能,允许用户连接到Apache Iceberg REST目录,包括Amazon S3 Tables和Amazon SageMaker Lakehouse。这项扩展利用AWS在S3 Tables和SageMaker Lakehouse中新增的Iceberg表格支持,提供了一个端到端的Iceberg数据读取解决方案。 要使用此功能,用户需要DuckDB 1.2.1版本以及来自`core_nightly`仓库的`aws`、`httpfs`和`iceberg`扩展的“最新”版本。连接需要通过Secrets Manager设置AWS凭证,或手动提供凭证,然后使用ATTACH命令以及S3 Tables ARN或SageMaker Lakehouse端点。 该集成支持模式演变,能够反映对表结构所做的更改。虽然目前处于实验阶段并正在开发中,但计划在今年晚些时候发布稳定版本。此功能简化了在AWS中查询Iceberg表格的操作,提供了跨不同服务的表格数据的统一视图。

Hacker News正在讨论Amazon S3 Tables和DuckDB的集成。一篇关于它们结合的文章引发了兴趣,评论者们称赞了S3中的parquet文件和DuckDB引擎的组合。一位用户尚未尝试S3 Tables,但却对此很感兴趣。然而,他们也表达了对其实用性的担忧,强调了由于服务由不同的团队管理,与其他AWS服务的潜在集成问题,并以过去的集成问题为例。他们质疑它与AWS生态系统中现有的表格产品有何不同。总的来说,讨论反映了对S3 Tables和DuckDB潜力的兴奋,以及对AWS环境中潜在挑战的担忧。

原文
Sam Ansmink, Tom Ebergen, Gabor Szarnyas

Published on 2025-03-14

TL;DR: We are happy to announce a new preview feature that adds support for Apache Iceberg REST Catalogs, enabling DuckDB users to connect to Amazon S3 Tables and Amazon SageMaker Lakehouse with ease.

In recent years, the Iceberg open table format has become increasingly popular. Major data warehouse platforms such as Databricks, Snowflake, Google BigQuery and AWS have all announced or already implemented support for Iceberg tables. These platforms also support Iceberg catalogs, which are responsible for tracking current metadata for a collection of Iceberg tables grouped by namespaces.

DuckDB has supported reading Iceberg tables since September 2023 via the iceberg extension. Today, we are happy to introduce a new preview feature in this extension, which allows attaching to Iceberg REST catalogs. This preview release coincides with two AWS announcements yesterday: support for Iceberg tables in Amazon S3 Tables and the GA release of the integration between S3 Tables and SageMaker Lakehouse (AWS Glue Data Catalog). In practice, these developments mean that DuckDB now provides an end-to-end solution for reading Iceberg tables in S3 Tables and SageMaker Lakehouse.

DuckDB's support for Iceberg REST Catalog endpoints in Amazon S3 Tables is the result of a collaboration between AWS and DuckDB Labs.

Steps for Installing

To connect to Apache Iceberg REST Catalogs in DuckDB, make sure you are running the latest stable DuckDB release (version 1.2.1). For our example steps, we'll use the DuckDB CLI client. You can obtain this client from the installation page and start it with:

Next, we need to install the “bleeding edge” versions of the required extensions from the core_nightly repository.

FORCE INSTALL aws FROM core_nightly;
FORCE INSTALL httpfs FROM core_nightly;
FORCE INSTALL iceberg FROM core_nightly;

For more information on using the core_nightly repository, please see the notes at the end of the post.

With these extensions installed, your DuckDB is now capable of using Apache Iceberg REST Catalogs. Let's find some data.

Setting up an Amazon S3 Table Bucket

(If you already have Iceberg tables in Amazon S3 Tables, you can skip to the “Reading Iceberg Catalogs with DuckDB” section.)

In this post, we demonstrate how to read data from Amazon S3 Tables. To follow along, make sure that your account has s3tables permissions and create a new S3 table bucket. Note that Amazon S3 Tables is currently only supported in selected AWS regions.

Populating an Amazon S3 Table Bucket

If you don't have an S3 table bucket with tables already, we found the easiest way to get going is to create a table using Amazon Athena. See their instructions. For our example, we created a simple table with three columns using the Athena query editor:

CREATE TABLE duck_species (
    id INT,
    english_name STRING,
    latin_name STRING
) TBLPROPERTIES ('table_type' = 'ICEBERG');

Let's insert some data to the table:

INSERT INTO duck_species VALUES
    (0, 'Anas nivis', 'Snow duck');

Reading Amazon S3 Tables with DuckDB

Querying S3 Tables with DuckDB is really easy. The first step is to get your AWS credentials into DuckDB. You can achieve this in two ways. First, you can let DuckDB detect your AWS credentials and configuration based on the default profile in your ~/.aws directory by creating the following secret using the Secrets Manager:

CREATE SECRET (
    TYPE s3,
    PROVIDER credential_chain
);

Alternatively, you can set the AWS key, secret, and region values manually. For example:

CREATE SECRET (
    TYPE s3,
    KEY_ID 'AKIAIOSFODNN7EXAMPLE',
    SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
    REGION 'us-east-1'
);

Tip To see the secrets in your session, run FROM duckdb_secrets();

Next, point DuckDB to your S3 table bucket. You can do so by copy-pasting the S3 Tables ARN value directly from the AWS Management Console and using it in the ATTACH command:

ATTACH 'arn:aws:s3tables:us-east-1:111122223333:bucket/bucket_name'
    AS s3_tables_db (
        TYPE iceberg,
        ENDPOINT_TYPE s3_tables
    );

And that's all! Now, DuckDB is connected to Amazon S3 Tables. To show the available tables, run:

┌──────────────┬─────────┬───────────────┬──────────────┬──────────────┬───────────┐
│   database   │ schema  │     name      │ column_names │ column_types │ temporary │
│   varchar    │ varchar │    varchar    │  varchar[]   │  varchar[]   │  boolean  │
├──────────────┼─────────┼───────────────┼──────────────┼──────────────┼───────────┤
│ s3_tables_db │ ducks   │ duck_species  │ [__]         │ [INTEGER]    │ false     │
└──────────────┴─────────┴───────────────┴──────────────┴──────────────┴───────────┘

You can query tables as if they were ordinary DuckDB tables:

FROM s3_tables_db.ducks.duck_species;
┌───────┬──────────────┬────────────┐
│  id   │ english_name │ latin_name │
│ int32 │   varchar    │  varchar   │
├───────┼──────────────┼────────────┤
│   0   │ Anas nivis   │ Snow duck  │
└───────┴──────────────┴────────────┘

You also have an alternative option to connect to S3 Tables using the Amazon SageMaker Lakehouse (AWS Glue Data Catalog) Iceberg REST Catalog endpoint. To do so, run:

ATTACH 'account_id:s3tablescatalog/namespace_name'
AS (
    TYPE iceberg,
    ENDPOINT_TYPE glue
);

Tip If you need basic read access to tabular data in a single S3 table bucket, use the s3_tables endpoint type. If you want a unified view across all of your tabular data in AWS, use the glue endpoint type.

Schema Evolution

A key feature of the Iceberg format is schema evolution, i.e., the ability to follow changes in the table's schema. To demonstrate this, we go back to the Athena query editor and add a new column to the duck_species table:

ALTER TABLE duck_species
    ADD COLUMNS (conservation_status STRING);

Then, we insert a few more duck species:

INSERT INTO duck_species VALUES
    (1, 'Anas eatoni', 'Eaton''s pintail', 'Vulnerable'),
    (2, 'Histrionicus histrionicus', 'Harlequin duck', 'Least concern');

Let's run the query again from DuckDB:

FROM s3_tables_db.ducks.duck_species;

The query now returns a table with the additional fourth column, which has a NULL value in the row inserted before the change in the schema – as expected.

┌───────┬───────────────────────────┬─────────────────┬─────────────────────┐
│  id   │       english_name        │   latin_name    │ conservation_status │
│ int32 │          varchar          │     varchar     │       varchar       │
├───────┼───────────────────────────┼─────────────────┼─────────────────────┤
│     1 │ Anas eatoni               │ Eaton's pintail │ Vulnerable          │
│     2 │ Histrionicus histrionicus │ Harlequin duck  │ Least concern       │
│     0 │ Anas nivis                │ Snow duck       │ NULL                │
└───────┴───────────────────────────┴─────────────────┴─────────────────────┘

The latest preview release of the DuckDB iceberg extension enables directly reading tables using Iceberg REST endpoints. This allows you to query Amazon S3 Tables and Amazon SageMaker Lakehouse (AWS Glue Data Catalog) with ease. As of today, the extension is in an experimental state and is under active development. We will publish a stable release later this year.

Cleaning Up

If you created a new S3 table bucket to follow the examples, don't forget to clean up by deleting your S3 table bucket.

Using the core_nightly Repository

The extensions used in this blog post are currently experimental, and hence they are distributed through the core_nightly repository. If you want to switch back to using extensions from the core repository, follow the extension documentation.

Note that DuckDB does not support reloading extensions. Therefore, if you experience any issues, try restarting DuckDB after updating the extensions.

联系我们 contact @ memedata.com