Zml-smi：用于GPU、TPU和NPU的通用监控工具

Zml-smi：用于GPU、TPU和NPU的通用监控工具
Zml-smi: universal monitoring tool for GPUs, TPUs and NPUs

## zml-smi：通用硬件监控 zml-smi 是一款全面的 GPU、TPU 和 NPU 诊断和监控工具，是 nvidia-smi 和 nvtop 等工具的多功能替代品。它提供 NVIDIA、AMD、Google TPU 和 AWS Trainium 设备在硬件性能和健康状况方面的实时洞察，并计划随着 ZML 的扩展支持更多平台。主要功能包括：通过 `--top` 标志显示设备利用率、温度和内存使用情况；提供主机级别指标，如 CPU 利用率和内存；以及详细说明使用设备的进程及其资源使用信息。 zml-smi 专为可移植性而设计，仅需要设备驱动程序和 GLIBC，并在完全沙盒化的环境中运行。它利用现有库（NVML 用于 NVIDIA，AMD SMI 用于 AMD）和 API（gRPC 用于 TPU，libnrt 用于 Trainium）来收集详细指标——镜像 tpu-info 和 neuron-top 等工具的数据——甚至可以通过下载的 ID 文件动态更新 AMD GPU 的识别。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交登录 zml-smi: 用于GPU、TPU和NPU的通用监控工具 (zml.ai) 9点由 steeve 2小时前 | 隐藏 | 过去 | 收藏 | 3评论帮助 rdyro 13分钟前 | 下一个 [–] 看起来很棒！nvtop 也可以通过 https://github.com/rdyro/libtpuinfo/ https://github.com/Syllo/nvtop/blob/76890233d759199f50ad3bdb... 支持TPU。回复 mrflop 2小时前 | 上一个 [–] 重命名fopen64以拦截库调用感觉像是一种脆弱的黑客行为，伪装成“沙盒”。为什么不直接将此硬件支持上游到nvtop，而不是分散生态系统？回复 steeve 1小时前 | 父级 [–] 不幸的是，沙盒功能无法上游。这样，沙盒功能就保存在zml中，而不是修补mesa。至于nvtop，是一个很棒的程序，但我们缺少一些功能（例如沙盒）。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

zml-smi is a universal diagnostic and monitoring tool for GPUs, TPUs and NPUs. It provides real-time insights into the performance and health of your hardware.

It is a mix between nvidia-smi and nvtop.

It transparently supports all the platforms ZML supports. That is NVIDIA, AMD, Google TPU and AWS Trainium devices. It will be extended to support more platforms in the future as ZML continues to expand its hardware support.

Getting started

You can download zml-smi from the official mirror.

$ curl -LO 'https://mirror.zml.ai/zml-smi/zml-smi-v0.2.tar.zst'
$ tar -xf zml-smi-v0.2.tar.zst
$ ./zml-smi/zml-smi

Listing devices

$ zml-smi

Monitoring devices

The --top flag provides real-time monitoring of device performance, including utilization, temperature, and memory usage.

$ zml-smi --top

Completely sandboxed

zml-smi doesn’t require any software on the target machine besides the device driver and the GLIBC (mostly due to the fact that some shared objects from vendors are loaded).

Host

zml-smi displays host-level metrics such as CPU model and utilization, memory usage, and temperature.

Available metrics

Hostname, Kernel, CPU Model, CPU Core Count, Memory Used / Total, Uptime, Load Average (1m / 5m / 15m), Device Count

Processes

zml-smi also provides insights into the processes utilizing the devices, including their resource usage and command lines. This is available for all platforms.

Available metrics

PID, Device Index, Device Utilization, Device Memory, Process Command Line

NVIDIA

Metrics are given through the NVML library, which ships with the driver. As such, it is expected to be on the system.

Available metrics

GPU Utilization, Temperature, Power Draw, Encoder Utilization, Decoder Utilization, VRAM Used, VRAM Total, Memory Bus Width, Temperature, Fan Speed, Power Draw, Power Limit, Graphics Clock, SM Clock, Memory Clock, Max Graphics Clock, Max Memory Clock, PCIe Link Generation, PCIe Link Width, PCIe TX Throughput, PCIe RX Throughput

AMD

Metrics are provided through the AMD SMI library. zml-smi ships with it in its sandbox.

In order to support the latest AMD GPUs, zml-smi at build time downloads the amdgpu.ids file from both Mesa and ROCm (7.2.1 at the time of this article) and merges them together. This allows zml-smi to recognize and report on the latest AMD GPU models, even if they are not yet included in the official ROCm release. This is the case for Ryzen AI Max+ 395 (Strix Halo) for instance.

Sandboxing that file turned somewhat tricky. Because libdrm-amdgpu expects to find it in /opt/amdgpu/share/libdrm/amdgpu.ids, we had to get a bit creative. We didn’t want to install anything outside the binary sandbox. Nor did we want to patch that string inside libdrm.

So we created a shared object named zmlxrocm.so that is added to the DT_NEEDED section of libdrm_amdgpu.so.1. Then, fopen64 is renamed to zmlxrocm_fopen64, which is then provided by zmlxrocm.so. Since we now sit between libdrm and fopen64, we can intercept the call to fopen64, compare the path against /opt/amdgpu/share/libdrm/amdgpu.ids and redirect it to the sandboxed copy of the file.

Available metrics

GPU Utilization, Memory Usage, Temperature, Power Draw, VRAM Used, VRAM Total, Temperature, Fan Speed, Power Draw, Power Limit, Graphics Clock, SoC Clock, Memory Clock, Max Graphics Clock, Max Memory Clock, PCIe Bandwidth, PCIe Link Generation, PCIe Link Width

TPU

Metrics are provided via the local gRPC endpoint exposed by the TPU runtime. Those are the same metrics exposed to the tpu-info tool from Google.

Available metrics

TensorCore Duty Cycle, HBM Used, HBM Total

AWS Trainium

Metrics are provided through a private API found in libnrt.so, which zml-smi embeds in its sandbox. Those are the same metrics provided by the neuron-top utility.

Available metrics

Core Utilization, HBM Used, HBM Total, Tensor Memory, Constant Memory, Model Code, Shared Scratchpad, Nonshared Scratchpad, Runtime Memory, Driver Memory, DMA Rings, Collectives, Notifications