```F3```
F3

原始链接: https://github.com/future-file-format/f3

**F3 (Future-proof File Format)** 是一种新一代开源列式数据格式,旨在解决 Parquet 等传统格式在布局和扩展性方面的局限性。F3 通过将 WebAssembly (Wasm) 解码器直接嵌入自描述数据文件中,优先考虑了效率和互操作性。这种架构允许开发人员在无需进行全局格式更新的情况下实现新的编码方案,从而确保格式能够随着硬件和工作负载的发展而长期适应。 该项目目前是一个**研究原型**,旨在验证 2025 年 ACM 研究论文《F3: The Open-Source Data File Format for the Future》中提出的概念。该代码库包含了核心格式实现、基准测试工具和实验脚本,并提供了用于复现论文结果的文档。 **重要提示:** F3 仅供研究使用,目前不适合生产环境。该项目采用 MIT 许可证授权。

F3 文件格式是一种旨在取代 Parquet 等现有标准的开源数据格式提案。F3 的一个关键特性是它将 WebAssembly (Wasm) 解码器直接嵌入到每个文件中,从而在无需特定语言 SDK 的情况下确保了跨平台兼容性。 该项目在 Hacker News 上引发了广泛讨论,主要集中在以下三个方面: * **安全顾虑:** 持怀疑态度的人质疑在数据文件中嵌入可执行代码的安全性。支持者则认为,Wasm 固有的沙盒机制降低了这些风险,因为代码的访问权限范围可以得到严格限制。 * **缺乏清晰度:** 许多用户批评该项目的文档依赖于模糊的“营销术语”。批评者指出,README 未能明确界定该格式的具体应用场景、提供性能指标,或解释其为何优于 Parquet 或 ORC 等既有格式。 * **实用性:** 潜在采用者仍持观望态度,指出该项目缺乏广泛的工具支持,近期缺乏开发活动,且难以通过快速阅读来理解该格式的价值主张。 归根结底,尽管基于 Wasm 的可移植解码器这一技术概念被认为是创新的,但该项目在开发者沟通和生态系统采纳方面仍面临重重障碍。
相关文章

原文

f3_logo

F3 is a data file format that is designed with efficiency, interoperability, and extensibility in mind. It provides a data organization that rectifies the layout shortcomings of the last-generation formats like Parquet, while at the same time maintaining good interoperability and extensibility (a.k.a future-proof) via embedded Wasm decoders.

⚠️ This project is a research prototype verifying the ideas in the paper. You should not use it in production.

We only tested on an Intel machine with Debian 12.

git submodule update --init --recursive
./scripts/setup_debian.sh
# build the PoC package of F3
cargo build -p fff-poc
# run unit test for F3
cargo test -p fff-poc

format: FlatBuffer definition of the file format.

fff-poc: The main code of the F3 format. It references other subdirs like fff-core, fff-encoding, fff-format, and fff-ude-wasm.

fff-bench: Benchmarks and experiments appeared in the paper. Specifically, fff-bench/examples should contain most experiments, both micro and e2e.

fff-ude*: ude stand for User-Defined-Encoding and code in those directories relates to the Wasm decoding implementation.

scripts and exp_scripts: scripts related to run the experiments.

Reproduction steps for the experiment results in the paper

Please refer to doc/paper_reproduction.md for the detailed steps.

This project is licensed under the MIT License. See LICENSE for details.

If you find this project useful, please consider citing:

@article{zeng2025f3,
author = {Zeng, Xinyu and Meng, Ruijun and Prammer, Martin and McKinney, Wes and Patel, Jignesh M. and Pavlo, Andrew and Zhang, Huanchen},
title = {F3: The Open-Source Data File Format for the Future},
year = {2025},
issue_date = {September 2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {3},
number = {4},
url = {https://doi.org/10.1145/3749163},
doi = {10.1145/3749163},
abstract = {Columnar storage formats are the foundation for modern data analytics systems. The proliferation of open-source file formats (i.e., Parquet, ORC) allows seamless data sharing across disparate platforms. However, these formats were created over a decade ago for hardware and workload environments that are much different from today. Although these formats have incorporated some updates to their specification to adapt to these changes, not all deployments support those modifications, and too often systems cannot overcome the formats' deficiencies and limitations without a rewrite.In this paper, we present the Future-proof File Format (F3) project. It is a next-generation open-source file format with interoperability, extensibility, and efficiency as its core design principles. F3 obviates the need to create a new format every time a shift occurs in data processing and computing by providing a data organization structure and a general-purpose API to allow developers to add new encoding schemes easily. Each self-describing F3 file includes both the data and meta-data, as well as WebAssembly (Wasm) binaries to decode the data. Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable. To evaluate F3, we compared it against legacy and state-of-the-art open-source file formats. Our evaluations demonstrate the efficacy of F3's storage layout and the benefits of Wasm-driven decoding.},
journal = {Proc. ACM Manag. Data},
month = sep,
articleno = {245},
numpages = {27},
keywords = {columnar storage, compression, extensibility, file format}
}
联系我们 contact @ memedata.com