显示 HN:Linux 上 GPU 的确定性 PCIe 诊断
Show HN: Deterministic PCIe Diagnostics for GPUs on Linux

原始链接: https://github.com/parallelArchitect/gpu-pcie-diagnostic

这个命令行工具提供了一种确定性的GPU PCIe链路健康诊断方法,无需依赖系统修改或假设。它直接测量PCIe链路状态(代数、宽度)、复制带宽(主机到设备 & 设备到主机)以及持续利用率,通过NVML的硬件计数器实现。 该工具根据*唯一*的可观察数据提供清晰的“正常”、“降级”或“性能不足”的结论,识别诸如意外链路协商(例如,x8而非x16)或带宽下降等问题。它不尝试修复问题,仅客观地报告问题。 主要功能包括详细报告理论和实际带宽、效率计算,以及通过PCIe高级错误报告的可选完整性检查。以CSV和JSON格式记录日志,并使用唯一的GPU UUID,可以实现可重复的基线和时间序列分析。 该工具与Linux(在Ubuntu上测试过)兼容,需要NVIDIA驱动程序和CUDA工具包。它旨在隔离PCIe链路性能与内核/工作负载的影响,提供一种可靠的方法来识别和证明与PCIe相关的瓶颈。

一种新的Linux工具,在Hacker News上分享,旨在诊断GPU的PCIe链路健康状况和带宽——这些问题通常隐藏在典型软件之外。由gpu_systems(github.com/parallelarchitect)开发,它使用NVML和sysfs报告关键指标,如PCIe代数、宽度和持续传输速率。 该工具基于纯硬件数据对链路质量给出“诊断结果”,解决诸如由于Risers或分叉导致的代数降级或通道宽度缩减等问题,这些问题无法通过内核调整解决。 目前,该工具**仅适用于Nvidia**,依赖于Nvidia的管理库。用户建议增加内存块检查等功能,但开发者明确了其当前关注点。它利用了Windows上不可用的Linux特定功能。
相关文章

原文

A deterministic command-line tool for validating GPU PCIe link health, bandwidth, and real-world PCIe utilization using only observable hardware data.

This tool answers one question reliably:

Is my GPU’s PCIe link behaving as it should, and can I prove it?

No registry hacks. No BIOS assumptions. No “magic” optimizations.

Only measurable link state, copy throughput, and hardware counters.

This tool performs hardware-observable PCIe diagnostics and reports factual results with deterministic verdicts.

It measures and reports directly from GPU hardware:

  • PCIe current and maximum link generation and width (via NVML)
  • Peak Host→Device and Device→Host copy bandwidth using CUDA memcpy timing
  • Sustained PCIe utilization under load using NVML TX/RX counters
  • Efficiency relative to theoretical PCIe payload bandwidth
  • Clear VERDICT from observable conditions only

The tool does not attempt to tune, fix, or modify system configuration.

  • OK — The negotiated PCIe link and measured throughput are consistent with expected behavior.
  • DEGRADED — The GPU is operating below its maximum supported PCIe generation or width.
  • UNDERPERFORMING — The full link is negotiated, but sustained bandwidth is significantly lower than expected.

Verdicts are rule-based and derived only from measured data.

Modern systems frequently exhibit PCIe issues that are difficult to diagnose:

  • GPUs negotiating x8 / x4 / x1 instead of x16
  • PCIe generation downgrades after BIOS or firmware updates
  • Slot bifurcation, riser cable, or motherboard lane-sharing issues
  • Reduced PCIe bandwidth occurring while system status is reported as normal
  • Confusion between PCIe transport limits and workload bottlenecks

This tool exists to:

  1. Reproducible PCIe diagnostic baseline
  2. Hardware-level proof of PCIe behavior
  3. Isolate link negotiation from kernel/workload effects
GPU PCIe Diagnostic & Bandwidth Analysis v2.7.4
GPU:   NVIDIA GeForce GTX 1080
BDF:   00000000:01:00.0
UUID:  GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (redacted)

PCIe Link
  Current: Gen3 x16
  Max Cap: Gen3 x16
  Theoretical (payload): 15.76 GB/s
  Transfer Size: 1024 MiB

Peak Copy Bandwidth
  Host → Device: 12.5 GB/s
  Device → Host: 12.7 GB/s

Telemetry (NVML)
  Window:   5.0 s (50 samples @ 100 ms)
  TX avg:   7.6 GB/s
  RX avg:   7.1 GB/s
  Combined: 14.7 GB/s

Verdict
  State:      OK
  Reason:     Throughput and link state are consistent with a healthy PCIe path
  Efficiency: 93.5%

System Signals (informational)
  MaxReadReq: 512 bytes
  Persistence Mode: Disabled
  ASPM Policy (sysfs string): [default] performance powersave powersupersave
  IOMMU: Platform default (no explicit flags)
  • NVIDIA GPU with a supported driver
  • CUDA Toolkit (for nvcc)
  • NVML development library (-lnvidia-ml)

Platform Compatibility Note

  • Linux operating system
  • Tested on Ubuntu 24.04.3 LTS

Permissions & Logging Notes

On some Linux systems, PCIe and NVML diagnostics require elevated privileges due to kernel and driver access controls. If log files were previously created using sudo, the results directory may become root-owned. In that case, subsequent runs may prompt for a password when appending logs.

To restore normal user access to the results directory:

sudo chown -R $USER:$USER results/

make

or manually:

nvcc -O3 pcie_diagnostic_pro.cu -lnvidia-ml -Xcompiler -pthread -o pcie_diag

./pcie_diag 1024

./pcie_diag 1024 --log --csv ./pcie_diag 1024 --log --json ./pcie_diag 1024 --log --csv --json

Logs are written to:

  • results/csv/pcie_log.csv
  • results/json/pcie_sessions.json

Extended Telemetry Window

./pcie_diag 1024 --duration-ms 8000

  • improves measurement stability

Optional Integrity Counters

./pcie_diag 1024 --integrity

  • Enables read-only inspection of PCIe Advanced Error Reporting (AER) counters via Linux sysfs, if exposed by the platform.
  • If counters are unavailable on the platform, integrity checks are automatically skipped with clear reporting.

Multi-GPU Logging Behavior

When running in multi-GPU mode (--all-gpus), each detected GPU is evaluated independently.

  • One result row (CSV) or object (JSON) is emitted per GPU per run.
  • Each entry includes device UUID and PCIe BDF for unambiguous attribution.
  • Multi-GPU configurations have not been exhaustively validated on all platforms.
  • Users are encouraged to verify results on their specific hardware.

Example:

./pcie_diag 1024 --all-gpus --log --csv
./pcie_diag 1024 --all-gpus --log --json 
./pcie_diag 1024 --gpu-index 1     # Target single GPU by index

Logging & Reproducibility

  • CSV and JSON logs include stable device identifiers
  • Device UUIDs are reported at runtime via NVML for consistent identification across runs
  • UUIDs shown in documentation are intentionally redacted
  • Logs are append-friendly for time-series analysis and automated monitoring
  • This tool evaluates PCIe transport behavior only
  • It does not measure kernel performance or application-level efficiency
  • It does not modify BIOS, firmware, registry, or PCIe configuration
  • It reports observable facts only and never infers beyond available data
  • Memcpy timing and PCIe behavior were cross-validated during development using Nsight Systems.
  • Nsight is not required to use this tool and is referenced only as an external correctness check.

Author: Joe McLaren (Human–AI collaborative engineering) https://github.com/parallelArchitect

MIT License

Copyright (c) 2025 Joe McLaren

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

联系我们 contact @ memedata.com