尾部延迟消除器：用于降低RAM读取尾部延迟的库

尾部延迟消除器：用于降低RAM读取尾部延迟的库
Tailslayer: Library for reducing tail latency in RAM reads

原始链接: https://github.com/LaurieWired/tailslayer

## Tailslayer：降低RAM读取延迟 Tailslayer是一个C++库，旨在通过缓解DRAM刷新停顿来最小化RAM读取的尾部延迟。它通过在具有不同刷新计划的多个独立DRAM通道上复制数据来实现这一点——利用AMD、Intel和Graviton处理器上未公开的“通道混洗”功能。该库采用*对冲读取*，同时从所有副本请求数据，并利用第一个响应。用户通过包含`hedged_reader.hpp`并提供两个函数来集成Tailslayer：一个`Signal`函数来确定*何时*读取（返回索引），以及一个`Work`函数来处理读取值。目前支持两个副本（基准测试中具有N路能力），Tailslayer自动处理地址计算和核心固定，并提供逻辑索引接口。数据在插入时复制到每个副本。`discovery`目录中提供了基准测试工具，用于表征DRAM刷新行为并评估性能。

对不起。

原文

Tailslayer is a C++ library that reduces tail latency in RAM reads caused by DRAM refresh stalls.

It replicates data across multiple, independent DRAM channels with uncorrelated refresh schedules, using (undocumented!) channel scrambling offsets that works on AMD, Intel, and Graviton. Once the request comes in, Tailslayer issues hedged reads across all replicas, allowing the work to be performed on whichever result responds first.

The library code is available in hedged_reader.cpp and the example using the library can be found in tailslayer_example.cpp. To use it, copy include/tailslayer into your project and #include <tailslayer/hedged_reader.hpp>. The library currently works with two channels (updates to come!), but full N-way usage is available in the benchmark.

You provide the value type and two functions as template parameters:

Signal function: Add the loop that waits for the external signal. This determines when to read. Return the desired index to read, and the read immediately fires.
Final work function: This receives the value immediately after it is read. Add the desired value processing code here.

#include <tailslayer/hedged_reader.hpp>

[[gnu::always_inline]] inline std::size_t my_signal() {
    // Wait for your event, then return the index to read
    return index_to_read;
}

template <typename T>
[[gnu::always_inline]] inline void my_work(T val) {
    // Use the value
}

int main() {
    using T = uint8_t;
    tailslayer::pin_to_core(tailslayer::CORE_MAIN);

    tailslayer::HedgedReader<T, my_signal, my_work<T>> reader{};
    reader.insert(0x43);
    reader.insert(0x44);
    reader.start_workers();
}

Arguments can be passed to either function via ArgList:

tailslayer::HedgedReader<T, my_signal, my_work<T>,
    tailslayer::ArgList<1, 2>,   // args to signal function
    tailslayer::ArgList<2>       // args to final work function
> reader{};

You can also optionally pass in a different channel offset, channel bit, and number of replicas to the constructor. Note: Each insert copies the element N times where N is the number of replicas. It does the address calculation work on the backend, allowing tailslayer to act as a hedged vector that uses logical indices. Additionally, each replica is pinned to a separate core, and will spin on that core according to the signal function until the read happens.

make
./tailslayer_example

Benchmarks and spike timing

The discovery/ directory contains supporting code used to characterize DRAM refresh behavior:

discovery/benchmark/: Channel-hedged read benchmark
discovery/trefi_probe.c: Spike timing probe for measuring the refresh cycle

cd discovery/benchmark
make
sudo chrt -f 99 ./hedged_read_cpp --all --channel-bit 8

尾部延迟消除器：用于降低RAM读取尾部延迟的库 Tailslayer: Library for reducing tail latency in RAM reads

Benchmarks and spike timing

尾部延迟消除器：用于降低RAM读取尾部延迟的库
Tailslayer: Library for reducing tail latency in RAM reads