Piccolo：基于细粒度内存分散收集的大规模图处理

Piccolo：基于细粒度内存分散收集的大规模图处理
Piccolo: Large-Scale Graph Processing with Fine-Grained In-Memory Scatter-Gather

本文介绍了Piccolo，一种新颖的图处理加速器，旨在解决图处理中由于不规则数据访问模式而固有的内存带宽瓶颈问题。现有的解决方案，例如图块划分和内存内处理（PIM），面临着一些局限性，包括DDR内存粒度利用效率低下以及难以将图块划分与PIM结合等问题。Piccolo通过实现细粒度的内存内随机散列聚集来克服这些问题，重点是通过内存内非算术函数来减少片外流量。 Piccolo并没有在内存中使用昂贵的算术单元，而是优化了数据移动。它重新设计了缓存和内存中心架构（MHA），以充分利用图块划分和内存内操作。这使得内存带宽和缓存容量得到有效利用。实验结果表明，Piccolo在各种基准测试中实现了最高3.28倍的加速比和1.62倍的几何平均加速比，展现了其在加速图处理方面的有效性。本文强调了Piccolo的架构作为传统方法的一种有前景的替代方案，因为它优先考虑高效的数据移动和管理。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Piccolo：基于细粒度内存散布-收集的大规模图处理 (arxiv.org) 7 分，来自 PaulHoule，31 分钟前 | 隐藏 | 过去 | 收藏 | 讨论加入我们 6 月 16-17 日在旧金山举办的 AI 初创公司学校！指导原则 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们搜索：

（评论） 2024-08-31

法学硕士的硬件加速：全面调查和比较 2024-09-08

破坏性更新——亡羊补牢 2025-03-20

FlashAttention-3：快速、准确的异步和低精度注意力 2024-07-13

原文

[Submitted on 7 Mar 2025 (v1), last revised 10 Mar 2025 (this version, v2)]

View a PDF of the paper titled Piccolo: Large-Scale Graph Processing with Fine-Grained In-Memory Scatter-Gather, by Changmin Shin and 9 other authors

View PDF HTML (experimental)

Abstract:Graph processing requires irregular, fine-grained random access patterns incompatible with contemporary off-chip memory architecture, leading to inefficient data access. This inefficiency makes graph processing an extremely memory-bound application. Because of this, existing graph processing accelerators typically employ a graph tiling-based or processing-in-memory (PIM) approach to relieve the memory bottleneck. In the tiling-based approach, a graph is split into chunks that fit within the on-chip cache to maximize data reuse. In the PIM approach, arithmetic units are placed within memory to perform operations such as reduction or atomic addition. However, both approaches have several limitations, especially when implemented on current memory standards (i.e., DDR). Because the access granularity provided by DDR is much larger than that of the graph vertex property data, much of the bandwidth and cache capacity are wasted. PIM is meant to alleviate such issues, but it is difficult to use in conjunction with the tiling-based approach, resulting in a significant disadvantage. Furthermore, placing arithmetic units inside a memory chip is expensive, thereby supporting multiple types of operation is thought to be impractical. To address the above limitations, we present Piccolo, an end-to-end efficient graph processing accelerator with fine-grained in-memory random scatter-gather. Instead of placing expensive arithmetic units in off-chip memory, Piccolo focuses on reducing the off-chip traffic with non-arithmetic function-in-memory of random scatter-gather. To fully benefit from in-memory scatter-gather, Piccolo redesigns the cache and MHA of the accelerator such that it can enjoy both the advantage of tiling and in-memory operations. Piccolo achieves a maximum speedup of 3.28$\times$ and a geometric mean speedup of 1.62$\times$ across various and extensive benchmarks.

From: Jinho Lee [view email]
[v1] Fri, 7 Mar 2025 03:27:33 UTC (1,813 KB)
[v2] Mon, 10 Mar 2025 02:41:21 UTC (1,813 KB)

Piccolo：基于细粒度内存分散收集的大规模图处理 Piccolo: Large-Scale Graph Processing with Fine-Grained In-Memory Scatter-Gather

Piccolo：基于细粒度内存分散收集的大规模图处理
Piccolo: Large-Scale Graph Processing with Fine-Grained In-Memory Scatter-Gather