优化 Datalog 用于 GPU

优化 Datalog 用于 GPU
Optimizing Datalog for the GPU

原始链接: https://danglingpointers.substack.com/p/optimizing-datalog-for-the-gpu

本文“为GPU优化Datalog”，介绍了GPULog系统，该系统旨在高效地在GPU上执行Datalog程序。Datalog程序通过显式元组和隐式规则定义关系，评估过程持续到达到不动点——本质上是重复的类似SQL的连接操作。作者利用半朴素评估算法，通过将元组分为“新”、“增量”和“全”三个桶来避免冗余计算，从而进行优化。 GPULog的核心创新是一种新型数据结构：哈希索引排序数组。该结构结合了紧凑的数据数组、用于高效查找的词法排序索引数组以及用于快速定位索引内起始点的哈希表。这使得能够实现连贯的内存访问模式，对于GPU性能至关重要。结果表明，GPULog的性能显著优于最先进的CPU Datalog实现（Soufflé），即使后者通过HIP移植到相同的GPU硬件上也是如此。作者认为，内存带宽而非计算是主要的性能瓶颈，为未来研究减少带宽的Datalog算法开辟了道路。

## Datalog 与 GPU 优化：Hacker News 讨论总结最近一篇关于将 Datalog 优化用于 GPU 的 Hacker News 帖子引发了社区内各种工具和观点的讨论。最初的帖子促使用户分享他们使用的 Datalog 实现，其中 **Datomic** 被突出为一种流行的选择。其他几个项目也被提及，包括 **Clingo**、**CozoDB**（一种混合关系/向量/图数据库，使用 Rust 编写 – 但开发似乎停滞）、**Datascript/Datahike**（用于 ClojureScript 的本地优先应用程序）和 **XTDB**（已从 Datalog API 转向 SQL）。争论涉及 Datalog 语法与 SQL（使用递归 CTE）的可用性，以及 **Ascent** 和 **Soufflé** 等工具在 Datalog 实现中的优势。进一步的讨论探讨了 Datalog 在 Rust 借用检查器 (**datafrog**) 和静态分析 (**CodeQL** 使用 Souffle) 等领域的应用。对话还涉及使用 CUDA/HIP 与 SPIR-V 进行 GPU 计算的挑战和好处，以及利用现有的优化库（如 **cudf**）进行图操作的重要性。最终，该帖子展示了一个充满活力的社区，积极探索和实现 Datalog，应用于各种不同的应用和平台。

原文

Optimizing Datalog for the GPU Yihao Sun, Ahmedur Rahman Shovon, Thomas Gilray, Sidharth Kumar, and Kristopher Micinski ASPLOS'25

Datalog source code comprises a set of relations, and a set of rules.

A relation can be explicitly defined with a set of tuples. A running example in the paper is to define a graph with a relation named edge:

Edge(0, 1)
Edge(1, 3)
Edge(0, 2)

A relation can also be implicitly defined with a set of rules. The paper uses the Same Generation (SG) relation as an example:

1: SG(x, y) <- Edge(p, x), Edge(p, y), x != y
2: SG(x, y) <- Edge(a, x), SG(a, b), Edge(b, y), x != y

Rule 1 states that two vertices (x and y) are part of the same generation if they both share a common ancestor (p), and they are not actually the same vertex (x != y).

Rule 2 states that two vertices (x and y) are part of the same generation if they have ancestors (a and b) from the same generation.

“Running a Datalog program” entails evaluating all rules until a fixed point is reached (no more tuples are added).

One key idea to internalize is that evaluating a Datalog rule is equivalent to performing a SQL join. For example, rule 1 is equivalent to joining the Edge relation with itself, using p as the join key, and (x != y) as a filter.

Semi-naïve Evaluation is an algorithm for performing these joins until convergence, while not wasting too much effort on redundant work. The tuples in a relation are put into three buckets:

new: holds tuples that were discovered on the current iteration
delta: holds tuples which were added in the previous iteration
full: holds all tuples that have been found in any iteration

For a join involving two relations (A and B), new is computed as the union of the result of 3 joins:

delta(A) joined with full(B)
full(A) joined with delta(B)
delta(A) joined with delta(B)

The key fact for performance is that full(A) is never joined with full(B).

More details on Semi-naïve Evaluation can be found in these notes.

This paper introduces the hash-indexed sorted array for storing relations while executing Semi-naïve Evaluation on a GPU. It seems to me like this data structure would work well on other chips too. Fig. 2 illustrates the data structure (join keys are in red):

The data array holds the actual tuple data. It is densely packed in row-major order.

The sorted index array holds pointers into the data array (one pointer per tuple). These pointers are lexicographically sorted (join keys take higher priority in the sort).

The hash table is an open-addressed hash table which maps a hash of the join keys to the first element in the sorted index array that contains those join keys.

A join of relations A and B, can be implemented with the following pseudo-code:

for each tuple 'a' in the sorted index array of A:

  lookup (hash table) the first tuple in B which has matching join keys to 'a'

  iterate over all tuples in the sorted index array of B with matching keys

Memory accesses when probing through the sorted index array are coherent. Memory accesses when accessing the data array are coherent up to the number of elements in a tuple.

Table 3 compares the results from this paper (GPULog) against a state-of-the-art CPU implementation (Soufflé). HIP represents GPULog ported to AMD’s HIP runtime and then run on the same Nvidia GPU.

The data structure and algorithms described by this paper seem generic, it would be interesting to see them run on other chips (FPGA, DPU, CPU, HPC cluster).

I would guess most of GPULog is bound by memory bandwidth, not compute. I wonder if there are Datalog-specific algorithms to reduce the bandwidth/compute ratio.

优化 Datalog 用于 GPU Optimizing Datalog for the GPU

优化 Datalog 用于 GPU
Optimizing Datalog for the GPU