展示 HN：Spice Cayenne – 基于 Vortex 的 SQL 加速

展示 HN：Spice Cayenne – 基于 Vortex 的 SQL 加速
Show HN: Spice Cayenne – SQL acceleration built on Vortex

原始链接: https://spice.ai/blog/introducing-spice-cayenne-data-accelerator

## 辣椒卡宴：下一代数据加速，助力规模化 Spice Cayenne 是 Spice.ai 推出的最新数据加速器，旨在处理低延迟的多太字节数据湖工作负载。Cayenne 旨在克服 DuckDB 和 SQLite 等现有加速器在规模化时的局限性，它结合了高性能的 **Vortex 列式格式**（来自 Linux 基金会）和精简的 **嵌入式元数据引擎**。这种分离优化了存储和元数据管理，从而实现 **更快的查询和显著降低的内存使用量**。Spice 通过在本地计算引擎中实现数据集来加速数据，减少网络 I/O 并实现亚秒级查询时间，而无需额外的基础设施。 Cayenne 通过利用 Vortex 高效的随机访问和与 Apache Arrow 的零拷贝兼容性，解决了大数据集中的关键挑战——并发瓶颈、高内存消耗和复杂的索引管理。基准测试表明，Cayenne 的查询速度比 DuckDB **快 1.4 倍**，同时使用的内存 **减少了近 3 倍**。目前处于 Beta 阶段，Spice Cayenne 通过在 Spicepod.yml 中配置 `engine: cayenne` 来配置，并承诺进一步改进，例如索引支持和额外的元数据后端。它旨在成为太字节和拍字节级分析和人工智能工作负载的领先加速器。

Spice.ai 发布了“Cayenne”，一种基于 Vortex 列式数据格式构建的新的 SQL 加速工具。Spice.ai 是一个轻量级、可移植的数据和 AI 引擎，利用 Apache DataFusion & Ballista，专为大规模数据处理和 AI 应用而设计。 Cayenne 利用“数据加速器”将来自各种来源的数据转化为嵌入式数据库。这个新版本受到 Ducklake 的启发，利用 Vortex 实现显著更快的性能——与 Apache Parquet 甚至 DuckDB 相比，随机访问速度快高达 100 倍，同时减少了内存使用并提高了对 PB 级数据集的可扩展性。开发者正在寻求对这个初始版本的反馈，强调 Spice 的原生能力，包括数据加速、联合、混合搜索和 LLM 推理。Spice 与 CedarDB 等项目不同，它专注于数据密集型应用和 AI 功能的无缝集成。更多信息请访问：[https://spice.ai/blog](https://spice.ai/blog) 和 GitHub 仓库：[https://github.com/spiceai/spiceai](https://github.com/spiceai/spiceai)。

原文

Introducing Spice Cayenne: The Next-Generation Data Accelerator Built on Vortex for Performance and Scale

Luke Kim

December 17, 2025

TLDR

Spice Cayenne is the next-generation Spice.ai data accelerator built for high-scale and low latency data lake workloads. It combines the Vortex columnar format with an embedded metadata engine to deliver faster queries and significantly lower memory usage than existing Spice data accelerators, including DuckDB and SQLite. Watch the demo for an overview of Spice Cayenne and Vortex.

Introduction

Spice.ai is a modern, open-source SQL query engine that enables development teams to federate, accelerate, search, and integrate AI across distributed data sources. It’s designed for enterprises building data-intensive applications and AI agents across disparate, tiered data infrastructure. Data acceleration of disparate and disaggregated data sources is foundational across many vertical use cases the Spice platform enables.

Spice leans into the industry shift to object storage as the primary source of truth for applications. These object store workloads are often multi-terabyte datasets using open data lake formats like Parquet, Iceberg, or Delta that must serve data and search queries for customer-facing applications with sub-second performance. Spice data acceleration, which transparently materializes working sets of data in embedded databases like DuckDB and SQLite, is the core technology that makes these applications built on object storage functional. Embedded data accelerators are fast and simple for datasets up to 1TB, however for multi-terabyte workloads, a new class of accelerator is required.

So we built Spice Cayenne, the next-generation data accelerator for high volume and latency-sensitive applications.

Spice Cayenne combines Vortex, the next-generation columnar file format from the Linux Foundation, with a simple, embedded metadata layer. This separation of concerns ensures that both the storage and metadata layers are fully optimized for what each does best. Cayenne delivers better performance and lower memory consumption than the existing DuckDB, Arrow, SQLite, and PostgreSQL data accelerators.

This post explains why we built Spice Cayenne, how it works, when it makes sense to use instead of existing acceleration options, and how to get started.

How data acceleration works in Spice

Spice accelerates datasets by materializing them in local compute engines; which can be ApacheDataFusion + Apache Arrow, SQLite, or DuckDB, in-memory or on-disk. This provides applications with high-performance, low-latency queries and dynamic compute flexibility beyond static materialization. It also reduces network I/O, avoids repeated round-trips to downstream data sources, and as a result, accommodates applications that need to access disparate data, join that data, and make it really fast. By bringing frequently accessed working sets of data closer to the application, Spice delivers sub-second, often single-digit millisecond queries without requiring additional clusters, ingestion pipelines, or ETL.

To support the wide range of enterprise workloads run on Spice, the platform includes multiple acceleration engines suited to different data shapes, query patterns, and performance needs. The Spice ethos is to offer optionality: development teams can choose the engine that best fits their requirements. These are currently the following acceleration engines:

PostgreSQL: PostgreSQL is great for row-oriented workloads, but is not optimized for high-volume columnar analytics.
Arrow (in-memory): Arrow is ideal for workloads that need very fast in-memory access and low-latency scans. The tradeoff is that data isn’t persisted to disk and more sophisticated operations like indexes aren’t supported.
DuckDB: DuckDB offers excellent all-around performance for medium-sized datasets and analytical queries. Single file limits and memory usage, however, can become a constraint as data volume grows beyond a terabyte.
SQLite: SQLite is a lightweight option that excels for smaller tables and row-based lookups. SQLite’s single-writer model, file single limits, and limited parallelism make it less ideal for larger or analytical workflows.

Why we built Spice Cayenne

Enterprise workloads on multi-terabyte datasets stored in object storage share a common set of pressure points; the volume of data continues to increase, more applications and services are querying the same accelerated tables at once, and teams need consistently fast performance without having to manage extra infrastructure.

Existing accelerators perform well at smaller scale but run into challenges at different inflection points:

Single-file architectures create bottlenecks for concurrency and updates.
Memory usage of embedded databases like DuckDB can be prohibitive.
Database and search index creation and storage can be prohibitive.

These constraints inspired us to develop the next-generation accelerator for petabyte-scale, that keeps metadata operations lightweight, and maintains low-latency, high-performance queries even as dataset sizes and concurrency increase. It also was critically important the underlying technologies aligned with the Spice philosophy of open-source with strong community support and governance.

Spice Cayenne addresses these requirements by separating metadata and data storage into two complementary layers: the Vortex columnar format and an embedded metadata engine.

Spice Cayenne architecture

Cayenne is built with two core concepts:

1. Data: Vortex Columnar Format

Data is stored in Vortex, the next-generation open-source, Apache-licensed format under the Linux Foundation.

Compared with Apache Parquet, Vortex provides:

100x faster random access
10–20x faster full scans
5x faster writes
Zero-copy compatibility with Apache Arrow
Pluggable compression, encoding, and layout strategies

Source: Vortex Github

Vortex has a clean separation of logical schema and physical layout, which Cayenne leverages to support efficient segment-level access, minimize memory pressure, and extend functionality without breaking compatibility. It draws on years of academic and systems research including innovations from projects like YouTube's Procella, FSST, FastLanes, ALP/G-ALP, and MonetDB/X100 to push the boundaries of what’s possible in open-source analytics.

Extensible and community-driven, Vortex is already integrated with tools like Apache Arrow, DataFusion, and DuckDB, and is designed to support Apache Iceberg in future releases. It’s also the foundation of commercial offerings from SpiralDB and PolarSignals. Since version 0.36.0, Vortex guarantees backward compatibility of the file format.

2. Metadata Layer

Cayenne stores metadata in an embedded database. SQLite is supported today, but aligned with the Spice philosophy of optionality, the design is extensive for pluggable metadata backends in the future. Cayenne’s metadata layer was intentionally designed as simple as possible, optimizing for maximum ACID performance.

The metadata layer includes:

Schemas
Snapshots
File tracking
Statistics
Refreshes

All metadata access is done through standard SQL transactions. This provides:

A single, local source of truth
Fast metadata reads
Consistent ACID semantics
No external catalog servers
No scattered metadata files

A single SQL query retrieves all metadata needed for query planning. This eliminates round-trip calls to object storage, supports file-pruning, and reduces sensitivity to storage throttling.

Together, the metadata engine and Vortex format enable Cayenne to scale beyond the limits of single-file engines while keeping acceleration operationally simple.

Benchmarks

So, how does Spice Cayenne stack up to the other accelerators?

We benchmarked Cayenne against DuckDB v1.4.2 using industry standard benchmarks (TPC-H SF100 and ClickBench), comparing both query performance and memory efficiency. All tests ran on a 16 vCPU / 64 GiB RAM instance (AWS c6i.8xlarge equivalent) with local NVMe storage. Cayenne was tested with Spice v1.9.0.

***Cayenne accelerated TPC-H queries 1.4x faster than DuckDB (file mode)*** *and used* ***nearly 3x less memory***.

***Cayenne was 14% faster than DuckDB file mode, and used 3.4x less memory.***

Spice Cayenne achieves faster query times and drastically lower memory usage by pairing a purpose-built execution engine with the Vortex columnar format. Unlike DuckDB, Cayenne avoids monolithic file dependencies and high memory spikes, making it ideal for production-grade acceleration at scale.

Getting started with Spice Cayenne

Use Cayenne by specifying engine: cayenne in the Spicepod.yml (dataset configuration).

Following are a few example configurations.

Basic:

datasets:
  - from: spice.ai:path.to.my_dataset
    name: my_dataset
    acceleration:
      engine: cayenne
      mode: file

Full configuration:

version: v1
kind: Spicepod
name: cayenne-example

datasets: 
  -  from: s3://my-bucket/data/
     name: analytics_data
     params: 
        file_format: parquet
     acceleration:
        engine: cayenne
        enabled: true
        refresh_mode: full
        refresh_check_interval: 1h

Memory

Memory usage depends on dataset size, query patterns, and caching configuration. Vortex’s design reduces memory overhead by using selective segment reads and zero-copy access.

Storage

Disk space is required for:

Vortex columnar data
Temporary files during query execution
Metadata tables

Provision storage according to dataset size and refresh patterns.

Roadmap

Spice Cayenne is in beta and still evolving. We encourage users to test Cayenne in development environments before deploying to production.

Upcoming improvements include:

Index support
Improved snapshot bootstrapping
Additional metadata backends
Advanced compression and encoding strategies
Expanded data type coverage

The goal for Spice Cayenne stable is for Cayenne to be the fastest, most efficient accelerator across the full range of analytical and operational data and AI workloads at terabyte & petabyte-scale.

Conclusion

Spice Cayenne represents a step function improvement in Spice data acceleration, designed to serve multi-terabyte, high concurrency, and low-latency workflows with predictable operations. By pairing an embedded metadata engine with Vortex’s high-performance format, Cayenne offers a scalable alternative to single-file accelerators while keeping configuration simple.

Spice Cayenne is available in beta. We welcome feedback on the road to its stable release.

‍