英伟达-smi 在大约66天后无限期挂起。
nvidia-smi hangs indefinitely after ~66 days

原始链接: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/971

此错误报告详细描述了在OpenEuler 2.0 (LTS-SP2)内核6.6.0-100上,NVIDIA开源GPU内核模块存在的问题。具体来说,在使用570.133.20版本驱动程序和OpenRM在B200 GPU上运行时,`nvidia-smi`在运行约66天12小时后会无限期挂起。 `dmesg`输出显示与`knvlink`未能更新和发现Rx后检测链路掩码(针对对等设备0和1)相关的重复错误。该问题发生在长时间运行后,并且已观察到一次。 报告强调此问题仅针对*开源*内核驱动程序中的错误,并请求确认该问题是否在使用专有驱动程序时发生。报告还确认正在使用稳定、非RC内核。包含一个日志文件(`nvidia-bug-report.log.gz`),但未提供其他信息。

``` Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 nvidia-smi 在大约66天后无限期挂起 (github.com/nvidia) 40 分,by tosh 55分钟前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索: ```
相关文章

原文

NVIDIA Open GPU Kernel Modules Version

[root@A11-R42-I61-42-5504045 ~]# cat /proc/driver/nvidia/params ResmanDebugLevel: 4294967295 RmLogonRC: 1 ModifyDeviceFiles: 1 DeviceFileUID: 0 DeviceFileGID: 0 DeviceFileMode: 438 InitializeSystemMemoryAllocations: 1 UsePageAttributeTable: 4294967295 EnableMSI: 1 EnablePCIeGen3: 0 MemoryPoolSize: 0 KMallocHeapMaxSize: 0 VMallocHeapMaxSize: 0 IgnoreMMIOCheck: 0 EnableStreamMemOPs: 0 EnableUserNUMAManagement: 1 NvLinkDisable: 0 RmProfilingAdminOnly: 1 PreserveVideoMemoryAllocations: 0 EnableS0ixPowerManagement: 0 S0ixPowerManagementVideoMemoryThreshold: 256 DynamicPowerManagement: 3 DynamicPowerManagementVideoMemoryThreshold: 200 RegisterPCIDriver: 1 EnablePCIERelaxedOrderingMode: 0 EnableResizableBar: 0 EnableGpuFirmware: 18 EnableGpuFirmwareLogs: 2 RmNvlinkBandwidthLinkCount: 0 EnableDbgBreakpoint: 0 OpenRmEnableUnsupportedGpus: 1 DmaRemapPeerMmio: 1 ImexChannelCount: 2048 CreateImexChannel0: 0 GrdmaPciTopoCheckOverride: 0 RegistryDwords: "" RegistryDwordsPerDevice: "" RmMsg: "" GpuBlacklist: "" TemporaryFilePath: "" ExcludedGpus: ""

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

[root@A11-R42-I61-42-5504045 ~]# cat /etc/openeuler-release openeuler release 2.0 (LTS-SP2) [root@A11-R42-I61-42-5504045 ~]#

Kernel Release

[root@A11-R42-I61-42-5504045 ~]# uname -a Linux A11-R42-I61-42-5504045. 6.6.0-100. SMP Fri Aug 22 10:50:04 CST 2025 x86_64 x86_64 x86_64 GNU/Linux
[root@A11-R42-I61-42-5504045 ~]# uname -r 6.6.0-100

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

B200

Describe the bug

nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200

[root@A11-R42-I61-42-5504045 ~]# dmesg -T | grep -i nvrm | head -n 10
[Sat Nov 22 05:08:50 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:08:50 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed!
[Sat Nov 22 05:08:54 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:08:54 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed!
[Sat Nov 22 05:08:58 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:08:58 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed!
[Sat Nov 22 05:09:02 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:09:02 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer0's postRxDetLinkMask failed!
[Sat Nov 22 05:09:06 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:09:06 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed!
[root@A11-R42-I61-42-5504045 ~]#

[root@A11-R42-I61-42-5504045 ~]# uptime
22:50:02 up 67 days, 6:11, 2 users, load average: 17.40, 16.73, 18.67
[root@A11-R42-I61-42-5504045 ~]# last reboot
reboot system boot 6.6.0-100. Tue Sep 16 16:38 still running
reboot system boot 6.6.0-100 Tue Sep 9 17:02 - 16:34 (6+23:32)

To Reproduce

nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200 and kernel 6.6.0

Bug Incidence

Once

nvidia-bug-report.log.gz

no

More Info

No response

联系我们 contact @ memedata.com