展示HN:分布式训练作业的自动恢复
Show HN: Autonomous recovery for distributed training jobs

原始链接: https://docs.tensorpool.dev/features/agent

## TensorPool Agent:自主训练作业恢复(Beta) TensorPool Agent 是一个 Beta 系统,旨在自动监控和恢复在 Kubernetes、Slurm 或 TensorPool Jobs 上运行的长时间分布式训练作业(几天/几周)。它专注于初始检查点*之后*的运行时错误,例如 GPU 故障、通信失败以及基础设施/存储问题——旨在节省 GPU 时间和迭代周期。 Agent 通过分析日志并尝试从最新的检查点恢复工作,但**仅在您明确列入白名单的权限下**进行。如果尝试恢复,您将通过短信/电子邮件收到通知。如果成功,训练将恢复;否则,您将收到根本原因分析和建议的操作。 **目前,它无法解决早期错误**,例如依赖问题。设置需要通过 TensorPool 控制台提供凭据(作业 ID、kubeconfig 或 Slurm 登录详细信息)。 Agent 循环通过以下状态:待处理、已启用、凭据错误、恢复中和已完成。由于目前处于 Beta 版本,欢迎提供反馈!

## TensorPool Agent:分布式训练的自主恢复 TensorPool,一家专注于基础模型训练的大规模计算公司,正在发布其**TensorPool Agent**的公开测试版。该工具旨在消除长期训练运行(几天到几周)中令人沮丧且代价高昂的凌晨3点任务崩溃问题。 通过分析超过10万小时的多节点GPU运行时间,TensorPool确定了常见的故障点,例如GPU错误(Xid错误)、存储问题(S3超时)和通信故障。该Agent自主监控在Kubernetes、Slurm或TensorPool Jobs上运行的任务,诊断并从上次检查点恢复这些问题。 如果自动恢复失败,该Agent将提供根本原因分析和尝试过的解决方案,以加速调试。目前,它专注于运行时错误,并正在进行工作以检测“静默”故障,即任务看似正在运行但没有进展的情况。TensorPool正在寻求用户对当前恢复方法和常见故障模式的反馈。
相关文章

原文

The TensorPool Agent is currently in beta. We’d love your feedback!

The TensorPool Agent is an autonomous monitoring and recovery system for long-running distributed training jobs on Kubernetes, Slurm, or TensorPool Jobs. It’s designed for large multi-node training jobs that run for days to weeks. When the TensorPool Agent detects a runtime error, it attempts to autonomously recover your training job from its last checkpoint. You explicitly whitelist the actions the TensorPool Agent can take on your behalf. Best case: The TensorPool Agent recovers your training job when you are AFK, letting you get more iteration cycles and avoid burning GPU hours. Worst case: The TensorPool Agent delivers a preliminary root cause analysis and the actions it would have taken.

Target Failures

The TensorPool Agent is designed to address runtime errors that occur deep into training:
  • GPU hardware faults: Xid errors (79, 63, 48, etc.)
  • Distributed communication failures, NCCL errors
  • Infrastructure problems: hardware failures, kernel panics
  • Storage problems: I/O errors, checkpoint corruption, S3 timeouts
  • Network problems: mounted object storage bucket issues
  • GPU memory problems: CUDA out of memory, memory leaks, gradient explosion

The TensorPool Agent is not intended to fix errors that occur early in training, such as dependency issues or distributed communication initialization failures. It’s designed to solve issues that occur after the first checkpoint.

How It Works

  1. Registration: Provide credentials to access your job scheduler of choice (Slurm, K8s, or TensorPool Jobs) on the TensorPool Agent dashboard. Whitelist permissions you allow the agent to take on your behalf.
  2. Monitoring: The training job is continuously monitored for failure.
  3. Recovery (if job fails): The TensorPool Agent analyzes logs, attempts to diagnose and fix the issue. The job enters a recovering state.
  4. Resolution: If recovery succeeds, monitoring resumes. You’re alerted about the failure, actions taken, and recovery status. If the TensorPool Agent lacks permissions, it provides a list of actions it attempted and would have tried.