硬木:一种用于Apache Parquet的新解析器
Hardwood: A New Parser for Apache Parquet

原始链接: https://www.morling.dev/blog/hardwood-new-parser-for-apache-parquet/

## 硬木:高性能 Parquet 解析 硬木是一个新的系统,专为高性能 Parquet 文件处理而构建,借鉴了 1BRC 的经验。其主要重点是通过并行化最大化 CPU 利用率,即使在 Parquet 格式的复杂性下也能实现高吞吐量。 硬木采用了多种技术:**页面级并行**(使用多个线程解码数据页面)、**自适应页面预取**(优先处理解码速度较慢的列)和 **跨文件预取**(重叠文件解码)。这些,以及减少分配等优化措施,显著提升了性能。 在 MacBook Pro M3 Max 上,硬木可以在约 1.2 秒内对 ~9.2GB 纽约出租车数据集的三个列求和(列读取器 API),并在约 1.3 秒内解析 900MB 的嵌套 Overture Maps 数据文件。 该项目利用 JDK Flight Recorder 进行瓶颈识别,并包含自动化性能测试,并计划使用 Apache Otava 构建自动化回归检测流水线,以确保持续的性能改进。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Hardwood:Apache Parquet 的新解析器 (morling.dev) 10 分,由 rmoff 发表于 3 小时前 | 隐藏 | 过去 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Hardwood is built with high performance in mind. It applies many of the lessons learned from 1BRC, such as memory-mapping files or multi-threading. I am planning to share more details in a future blog post, so I’m going to focus just on one specific performance-related aspect here: Parallelizing the work of parsing Parquet files, so as to utilize the available CPU resources as much as possible and achieve high throughput.

This task is surprisingly complex due to the subtleties of the format, so Hardwood pulls a few tricks for taking advantage of all the available cores:

  • Page-level parallelism, fanning out the work of decoding individual data pages to multiple worker threads. This allows for a much higher CPU utilization (and lower memory consumption) than when solely processing different column chunks, row groups, or even files in parallel.

  • Adaptive page prefetching, ensuring that columns which are slower to decode than others (e.g. depending on their data type) receive more resources, so that all columns of a file can be read at the same pace.

  • Cross-file prefetching, starting to map and decode the pages of file N+1 when approaching the end of file N of a multi-file dataset, avoiding any slowdown at file transitions.

By employing these techniques and some others, such as minimizing allocations and avoiding auto-boxing of primitive values, Hardwood’s performance has come quite a long way since starting the project at the end of last year. As an example, the values of three out of 20 columns of the NYC taxi ride data set (a subset of 119 files overall, ~9.2 GB total, ~650M rows) can be summed up in ~2.7 sec using the row reader API with indexed access on my MacBook Pro M3 Max with 16 CPU cores. With the column reader API, the same task takes ~1.2 sec.

The taxi ride data set has a completely flat schema, i.e. it doesn’t contain any structs, lists, or maps. Most Parquet-based data sets fall into this category, and thus the focus for optimizing Hardwood has primarily been on these kinds of files so far. While less commonly found, the Parquet format also supports nested schemas. An example for this category are the Parquet files of the Overture Maps project. On the same machine as above, Hardwood can completely parse all the columns of a file with points of interest (~900 MB, ~9M records) in ~2.1 sec using the row reader API and in ~1.3 sec with the column reader API.

In order to identify bottlenecks, Hardwood comes with support for the JDK Flight Recorder, tracking key performance metrics and events such as prefetch misses, page decoding times, etc.

Further improving performance remains a key objective for the project going forward; to that end there are some first automated performance tests for flat and nested schemas and we are planning to set up an automated change detection pipeline using Apache Otava, allowing us to detect any potential regressions early on.

联系我们 contact @ memedata.com