展示 HN：使用 Docker Compose 的虚拟 SLURM HPC 集群

展示 HN：使用 Docker Compose 的虚拟 SLURM HPC 集群
Show HN: Virtual SLURM HPC cluster in a Docker Compose

## 基于 Docker 的 SLURM 高性能计算集群本项目提供了一个精简、生产就绪的多容器高性能计算 (HPC) 环境，使用 Docker 和 Rocky Linux 9。由 eXact lab S.r.l. 开发，它虚拟化了一个具有 SLURM 工作负载管理器和 OpenMPI 支持的 HPC 系统，并可选地通过 MariaDB 提供完整的作业会计。默认设置包括一个主节点、两个工作节点（每个 4 个 vCPU/2GB RAM）和一个可选的数据库节点。提供 SSH 访问（使用生成的密钥 – *不适用于生产环境*）用于集群交互。用户通过挂载的主目录卷在节点之间同步，SLURM 配置共享。主要功能包括节点内/节点间 MPI 作业执行、通过软件包安装（使用 `packages.yml`）进行运行时自定义，以及模拟 scratch/work 区域的共享存储。SLURM 配置可以通过卷挂载进行自定义。缓存机制优化了软件包安装速度。镜像可在 GitHub Container Registry 上获取，也可以本地构建。本项目旨在用于教育和测试目的，提供了一种便捷的方式来探索和试验 HPC 概念。

## vHPC：一个虚拟SLURM HPC集群一位开发者开源了“vHPC”，这是一个使用Docker Compose和SLURM构建的虚拟高性能计算（HPC）集群。vHPC旨在方便针对大型生产HPC系统（Cineca Leonardo）的开发，解决了现有容器化解决方案通常缺乏诸如会计和MPI支持等关键功能的局限性。该项目力求简单和通用，提供一个本地原型设计环境，而无需像Ansible、Chef或Puppet等传统部署工具的复杂性。虽然存在几种HPC部署系统（包括来自Compute Canada和NVIDIA的选项），但vHPC提供了一种现代且易于访问的替代方案。用户对其功能感到好奇，包括SSH访问以及与OpenOnDemand（现有集群的Web界面）等工具的比较。vHPC主要用于开发和测试，在单台机器上模拟多节点集群，而不是部署全规模的生产环境。开发者可供提问。

原文

A Docker-based virtualization of a High Performance Computing (HPC) system running SLURM workload manager with OpenMPI support and optional full job accounting on Rocky Linux 9. This project creates a lean, production-ready multi-container environment.

This project is open-sourced by eXact lab S.r.l., a consultancy specializing in scientific and high-performance computing solutions. We help organizations optimize their computational workflows, implement scalable HPC infrastructure, and accelerate scientific research through tailored technology solutions.

Need HPC expertise? Contact us for consulting services in scientific computing, cluster optimization, and performance engineering.

MPI Ready: Full OpenMPI support with intra-node and inter-node job execution
User Management: User synchronization from head node to workers via mounted home volume
Shared Configuration: All SLURM configs shared via mounted volume
Full Job Accounting (Optional): Complete sacct functionality with MariaDB backend. You can opt not to start the database container and the system will still work, just without accounting
Runtime customisation: Install additional software onto the cluster at container startup without the need to rebuild the image, see Install Extra Packages below

Simply

The virtual cluster will start up in its default configuration:

One login node
Two worker nodes with 4 vCPU each and 2048M of vRAM
Full accounting via a database node (MariaDB)

Although it is of course possible to use docker exec to access the containers, we provide SSH access to emulate common cluster setups.

SSH Key Authentication (Recommended):

Upon container startup, you'll see a directory ssh-keys/ appearing. Those keys are enabled for both the root and non-privileged user.

You can reach the different nodes via

Head Node: ssh -i ./ssh-keys/id_ed25519 -p 2222 root@localhost
Worker1: ssh -i ./ssh-keys/id_ed25519 -p 2223 root@localhost
Worker2: ssh -i ./ssh-keys/id_ed25519 -p 2224 root@localhost

Swap root with user to access as non privileged user.

Warning: Each image version generates its own SSH keys. When upgrading to a new image version, remove the old keys with rm -r ssh-keys/ before starting the new infrastructure to avoid authentication failures.

Note for SSH Agent Users: If you have multiple keys loaded in your SSH agent, SSH may exhaust the server's allowed authentication attempts before trying the specified key file. If you encounter "Too many authentication failures", use the -o IdentitiesOnly=yes option to bypass agent keys:

ssh -i ./ssh-keys/id_ed25519 -o IdentitiesOnly=yes -p 2222 root@localhost

Password Authentication (Fallback):

Head Node SSH: ssh -p 2222 root@localhost (password: rootpass)
Worker1 SSH: ssh -p 2223 root@localhost (password: rootpass)
Worker2 SSH: ssh -p 2224 root@localhost (password: rootpass)
Non-privileged User: user (password: password) - recommended for job submission

⚠️ Security Note: SSH keys are automatically generated during container build for testing and educational purposes only. Do not use these keys in production environments.

In the example compose file, SSH is bound to the host's localhost only, preventing remote access from different hosts.

# Check cluster status
sinfo

# Submit a test job as user
su - user
srun hostname

# Submit MPI job (intra-node)
module load mpi/openmpi-x86_64
sbatch -N 1 -n 4 --wrap="mpirun -n 4 hostname"

# Submit MPI job (inter-node)
sbatch -N 2 -n 4 --wrap="mpirun -n 4 hostname"

# View job queue
squeue

# View job accounting (NEW!)
sacct                    # Show recent jobs
sacct -a                 # Show all jobs
sacct -j 1 --format=JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode

Working with Shared Storage

Besides the users' homes, we provide a second shared storage to emulate scratch or work area present in several production clusters.

# Shared storage is mounted at /shared on all nodes
# Create user directory for job files
mkdir -p /shared/user
chown user:user /shared/user

# MPI programs can be placed in shared storage
# Example: /shared/user/mpi_hello.c

(Optional) Install Extra Packages on the Virtual Cluster

You may provide a file packages.yml with listing extra packages to be installed by pip and/or dnf, and additionally execute arbitrary commands at container startup. To use this feature, you need to bind mount the packages.yml file to both headnode and worker nodes:

...
    volumes:
      ...
      # Optional: Mount packages.yml for runtime package installation
      - ./packages.yml:/packages.yml:ro
...

A packages.yml.example file is provided as a starting point. The file is structured into three main lists:

rpm_packages: for system packages (e.g., htop, git, vim)
python_packages: for Python libraries (e.g., pandas, requests)
extra_commands: for arbitrary shell commands executed as root during startup

Package installation and extra commands are handled directly in the shell entrypoint script, making installation progress visible via docker logs -f. Packages are persistent across container restarts and installation is idempotent.

Note: The entrypoint only adds packages, never removes them. If you need to remove packages or make deeper changes, enter the container manually with docker exec and use dnf remove or pip uninstall as needed.

Be mindful that:

installing large packages can increase the startup time of your containers; this is however a one-off price to pay the first time the cluster is started
if any installation fails, it will cause the container startup to fail
if an extra command fails, it will cause the container startup to fail
packages and extra commands are executed at container startup, before core services (like SLURM) are initialized

Caching: RPM package cache is persisted and shared as a volume (rpm-cache) across the cluster. This avoids re-downloading the same packages when starting multiple containers or restarting them. The first container downloads and caches packages; subsequent containers reuse the cached files.

Concurrency Control: To prevent DNF lock contention when multiple containers share the same cache, a file-based locking mechanism coordinates package installation operations. The lock file (/var/cache/dnf/.container_lock) ensures only one container can run DNF operations at a time. The system includes stale lock detection and automatic cleanup for containers that exit unexpectedly.

(Optional) Custom SLURM Configuration

Default Configuration: The cluster comes with a pre-configured SLURM setup that works out of the box. The default configuration is baked into the Docker images, so you can start the cluster immediately without any additional setup.

Custom Configuration (Optional): To override the default SLURM configuration:

Uncomment the volume mount in docker-compose.yml:

# - ./slurm-config:/var/slurm_config:ro # Host-provided config override (mounted to staging area)

Modify the configuration files in ./slurm-config/ as needed
Restart the cluster: docker-compose down && docker-compose up -d

How it works: The system uses a double mount strategy:

Default config is shipped with the images at /var/slurm_config/
Optional host override mounts to /var/slurm_config/ (staging area)
Headnode entrypoint copies config from staging to /etc/slurm/
/etc/slurm/ is shared via volume across all cluster nodes

You can add more slurm workers to the compose, using the existing ones as a template. Remember to also edit the NodeName line in slurm-config/slurm.conf accordingly.

The images are available on the GitHub Container Registry. You can also build the images locally by running

With the compose file included in this project you get:

Head Node: Runs slurmctld daemon, manages cluster, provides user synchronization, and optionally runs slurmdbd
Worker Nodes: Two by default. They run slurmd daemon, execute submitted jobs, and sync users from head node
Database Node (Optional): MariaDB 10.9 for SLURM job accounting
MPI Support: OpenMPI 4.1.1 with container-optimized transport configuration
Shared Storage: Persistent volumes for job data, user sync and home directories, and SLURM configuration

munge-key: Shared Munge authentication key across all nodes
shared-storage: Persistent storage for job files and MPI binaries
user-sync: User account synchronization from head node to workers
slurm-db-data: MariaDB persistent storage for job accounting
slurm-config: shared SLURM configuration files to override the configuration, see Custom SLURM Configuration
venv: shared Python virtual environment
rpm-cache: shared DNF package cache to avoid re-downloading packages across containers

MPI Transport: OMPI_MCA_btl=tcp,self (disables problematic fabric transports)

WARNING This project is for educational and testing purposes. Do not use in production!

Base OS: Rocky Linux 9
SLURM Version: 22.05.9 (from EPEL packages)
OpenMPI Version: 4.1.1 with container-optimized configuration
Database: MariaDB 10.9 with 64MB buffer pool for container optimization

If sacct shows "Slurm accounting storage is disabled": Database connection failed during startup
Check database logs: docker logs slurm-db
Restart headnode to retry database connection: docker restart slurm-headnode0
Verify database connectivity: docker exec slurm-db mysql -u slurm -pslurmpass -e "SELECT 1;"

Ensure MPI programs are in shared storage (/shared)
Use module load mpi/openmpi-x86_64 (or simply module load mpi) before compilation
Submit jobs as non-root user (user)

Job output files are created where the job runs (usually on worker nodes)
Use shared storage or home directories for consistent output location

This project is licensed under the MIT License - see the LICENSE file for details.