兰斯表格格式,简单来说,傻瓜式教程(动画)
Lance table format explained simply, stupid (Animated)

原始链接: https://tontinton.com/posts/lance/

## Lance:大数据领域的潜在新竞争者 2025年,大数据领域出现了显著发展,包括Iceberg的更新、Fluss等实时流处理新工具,以及Datadog和Databricks的收购。然而,一项可能具有颠覆性的技术**Lance**却在低调中出现。 Lance既是文件格式(如Parquet),又是表格式(如Iceberg),还是目录规范。它旨在提供比Parquet更好的随机读取性能,同时保持顺序读取速度。一个关键优势是它能够**在不完全复制数据的情况下向表中添加列**,这是Iceberg的局限性。 此外,Lance支持各种索引——BTree、倒排索引和向量索引,从而实现高效查询。它解决了人工智能兴起推动下对灵活、多模式数据湖日益增长的需求,并面临着Vortex等类似项目的竞争。Lance代表着朝着更具适应性和高性能的数据存储解决方案迈出的重要一步。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Lance 表格格式简单解释 (动画) (tontinton.com) 7 分,来自 tontinton 2 小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

TLDR (but stay for the animations!): Lance is a successor to Iceberg / Delta Lake, more optimized for random reads, and supports adding ad-hoc columns without needing to copy all the data.


Some big things happened in the big data over object storage world in 2025:

  • Iceberg V3 spec got released and added cool stuff like VARIANT.
  • turbopuffer announced a vector search over object storages (similar to Quickwit).
  • Apache Fluss lets Flink manage real-time streams with tiering to object storage.
  • Datadog bought Quickwit.
  • Databricks bought Neon.

But something way bigger flew completely under my radar, most likely as I was pretty busy building at $DAY_JOB (some pretty cool stuff, I must say).

This thing is called Lance. It's a file format (like Apache Parquet), a table format (like Apache Iceberg), and a catalog spec (like Iceberg's REST catalog spec).

Lance file format is similar to Parquet, but more optimized for random reads (WHERE id = 123), while still preserving Parquet's performance when doing sequential reads.

Official docs here.

Something interesting to test is how would Parquet behave if we configure it to store each page as 64kb instead of the default 1mb 🤔.

Lance table format is similar to Iceberg, but allows adding columns ad-hoc without copying all the data (just to add a value for the new column to all rows), while still preserving Iceberg's MVCC.

Another great feature of Lance tables is they also support indexes, such as BTree, inverted index (FTS), and vectors (e.g. HNSW).

Official docs here.

Apparently there's another open-source Parquet competing file format called vortex created by SpiralDB which seems like a direct competitor to LanceDB.

These technologies only came about because of a need for multi-modal data lakes now that AI is so prevalent.

I wonder what other technologies will come from this AI software era.

联系我们 contact @ memedata.com