稀疏文件 LRU 缓存
Sparse File LRU Cache

原始链接: http://ternarysearch.blogspot.com/2026/01/sparse-file-lru-cache.html

## 用于高效数据缓存的稀疏文件 稀疏文件为缓存大型数据集提供了一种巧妙的解决方案,如Amplitude所示。这些文件允许创建逻辑上很大的文件,而无需立即分配物理磁盘空间——只有在实际存储数据时才写入块。这对于分析工作负载非常理想,数据驻留在经济高效的冷存储(如S3)中,但需要快速的本地访问。 Amplitude的系统将数据从S3缓存到昂贵的NVMe SSD上。传统的缓存方法——缓存整个文件或单个列——被证明效率低下,要么浪费空间,要么产生过多的元数据开销。稀疏文件提供了一个中间地带:文件*看起来*是完整的,但仅物理存储实际使用的列,从而优化SSD使用并减少元数据。 一个本地RocksDB实例跟踪稀疏文件的哪些逻辑块被缓存,并实施LRU驱逐策略。这种方法减少了S3请求,最大限度地减少了文件系统开销,并简化了I/O,展示了低级文件系统功能如何显着提高整体系统性能。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 稀疏文件 LRU 缓存 (ternarysearch.blogspot.com) 4 点赞 由 paladin314159 1小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

An interesting file system feature that I came across a few years ago is sparse files. In short, many file systems allow you to create a logical file with "empty" (fully zeroed) blocks that are not physically backed until they get written to. Partially copying from ArchWiki, the behavior looks like this:

The file starts at 0 physical bytes on disk despite being logically 512MB. Then, after writing some non-zero bytes at a 16MB offset, it physically allocates a single block (4KB). The file system is maintaining metadata on which blocks of the file are physically represented on disk and which ones are not. To normal readers of the file, it's transparent -- the sparsity is managed completely by the file system.

At Amplitude, we found a cool use case for sparse files. All of the data is stored durably in cold storage (Amazon S3) in a columnar data format used for analytics queries. But it's inefficient and costly to fetch it from S3 every time, so the data gets cached on local NVMe SSDs (e.g. from the r7gd instance class). These local SSDs are more than ten times as expensive as cold storage, though, so you need a good strategy for deciding how and what to cache. To understand why sparse files are a good option for this, let's revisit columnar data formats briefly.

One of the observations about analytics queries that makes columnar data formats so effective is the fact that usually only a very small subset of columns (5-10 out of potentially thousands) are used on any particular query (and, to some extent, even across many queries). The data being stored in a columnar fashion means that each of these columns is a contiguous range inside the file, making it much faster to read. The contiguous ranges are also important in the context of our local caching -- we're not trying to pick out small pieces scattered across the file.

Setting sparse files aside for a moment, we originally had two different approaches for doing this caching:

  • The most naive strategy is caching entire files from S3. This is simple and requires minimal metadata to manage, but has the obvious downside of wasting a lot of disk space on the SSDs by storing the columns that are rarely or never used.
  • Another option is caching different columns as individual files on disk. This somewhat solves the wasted disk space issue, but now explodes the number of files, which requires a substantial amount of file system metadata. It also struggles with small columns, which are rounded up to the file system block size. With hundreds of thousands of customers of varying sizes, it's inevitable that the vast majority of files/columns are small.

At this point, it's pretty clear how sparse files give you an option in between these two. We imagine that we're caching entire files, except that the files are sparse, where only the columns that are used (more specifically, logical blocks that contain those columns) are physically present. This is simpler for a consumer to read and utilizes the disk better -- less file system metadata, and small columns are consolidated into file system blocks (as a bonus, this reduces S3 GETs as well). Similar to the latter approach above, consumers must declare the columns they need to the cache before reading them.

This system requires us to manage metadata on which columns are cached, which we used a local RocksDB instance for. More specifically, we track metadata on logical blocks of these sparse files: which ones are present locally, and when they were last read. Using this, we approximate an LRU policy for invalidating data when the disk fills up. As an important implementation detail, the logical blocks are variable-sized with a few smaller blocks at the head (still larger than file system blocks) and then larger blocks throughout the rest, which takes advantage of the fact that the file format has a metadata header (similar to Parquet) that always needs to be read to know how the columns are laid out.

This sparse file LRU cache improved many aspects of the query system simultaneously: fewer S3 GETs, less file system metadata, less file system block overhead, and fewer IOPS to manage the cache. It's rare that a feature that lives as low-level as the file system has such a prominent impact on system design, so when it happens, it's pretty neat.

联系我们 contact @ memedata.com