使用 Ansible 自动化无 root Docker 主机更新

使用 Ansible 自动化无 root Docker 主机更新
Automating rootless Docker host updates with Ansible

原始链接: https://du.nkel.dev/blog/2025-11-15_docker-rootless-ansible/

## 无 root Docker 与系统更新：Ansible 解决方案在采用无 root Docker 的主机上运行系统更新 (`apt upgrade`) 可能会导致容器故障，原因是运行中的守护进程与更新后的系统二进制文件之间存在版本不匹配。当核心 Docker 组件（如 `containerd`）升级时，会出现此问题，导致用户级守护进程过时。为了解决这个问题，一个 Ansible playbook (`apt.yaml`) 可以自动执行更新过程。它执行完整的系统升级，识别关键的 Docker 软件包更新，并且仅重启具有活动服务的用户的必要的无 root Docker 守护进程。该 playbook 利用 systemd-linger 检测活动的无 root Docker 实例，并通过选择性地重启它们来最大限度地减少停机时间。它设计用于通过 cron 进行每日自动执行，提供静默操作，除非需要干预。一个关键特性是报告更新过程中是否有软件包被保留。此自动化可以防止无 root Docker 设置中的特定故障模式，确保系统更新后容器的稳定性。该解决方案依赖于专用的 Ansible 管理 VM 和 SSH 代理转发以实现安全操作。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录使用 Ansible 自动化无root Docker 主机更新 (nkel.dev) 5 分，由 Helmut10001 发表于 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

Published: 2025-11-15, Revised: 2025-11-15

Ansible and Docker logos intertwined

TL;DR Running apt upgrade on hosts with rootless Docker services can break them. A version mismatch occurs between the running user daemon and the newly upgraded system binaries, causing containers to fail on restart. This post provides an ansible playbook that detects critical package changes and automatically restarts only the necessary rootless user daemons, preventing downtime and manual intervention.

Info

This playbook is the result of a deep-dive into a specific failure mode of the rootless Docker architecture. For context on the initial setup, please see my previous posts on setting up rootless Docker for services like Mastodon.

Motivation

As detailed in my rootless Docker setup guide, this architecture provides pretty good security isolation. However, it has a vulnerability: When system packages provided by docker-ce-rootless-extras (like containerd and its shims) are upgraded via apt, the running user-level Docker daemons become outdated.

This leads to a version mismatch. When a container is restarted, the old daemon tries to use the new on-disk shim, which causes a fatal error (e.g. unsupported shim version (3): not implemented). The solution is to restart the user's Docker daemon after every critical update, which is a perfect task for automation via ansible.

Ansible Playbook

This playbook automates the entire update and remediation process. It's designed to be run daily via a cron job, staying silent unless it needs to take action.

The Playbook: `apt.yaml`

Click to view

---
- hosts: ubuntu, debian
become: yes
become_method: sudo

vars:
    ansible_pipelining: true
    ansible_ssh_common_args: '-o ControlMaster=auto -o ControlPersist=60s'
    # This forces Ansible to use /tmp for its temporary files, avoiding
    # any permission issues when becoming a non-root user.
    ansible_remote_tmp: /tmp
    # critical package that require a docker restart
    critical_docker_packages:
    - docker-ce
    - docker-ce-cli
    - containerd.io
    - docker-ce-rootless-extras
    - docker-buildx-plugin
    - docker-compose-plugin
    - systemd

tasks:
    - name: Ensure en_US.UTF-8 locale is present on target hosts
    ansible.builtin.locale_gen:
        name: en_US.UTF-8
        state: present
    # We only run this once per host, not for every user
    run_once: true

    - name: "Update cache & Full system update"
    ansible.builtin.apt:
        update_cache: true
        upgrade: dist
        cache_valid_time: 3600
        force_apt_get: true
        autoremove: true
        autoclean: true
    environment:
        NEEDRESTART_MODE: automatically
    register: apt_result
    changed_when: "'0 upgraded, 0 newly installed, 0 to remove' not in apt_result.stdout"
    no_log: true

    - name: Report on any packages that were kept back
    ansible.builtin.debug:
        msg: "WARNING: Packages were kept back on {{ inventory_hostname }}. Manual review may be needed. Output: {{ apt_result.stdout }}"
    when:
        - apt_result.changed
        - "'packages have been kept back' in apt_result.stdout"

    - name: Find all users with lingering enabled
    ansible.builtin.find:
        paths: /var/lib/systemd/linger
        file_type: file
    register: lingered_users_find
    when:
        - apt_result.changed
        # This checks if any of the critical package names appear in the apt output
        - critical_docker_packages | select('in', apt_result.stdout) | list | length > 0
    no_log: true

    - name: Create a list of lingered usernames
    ansible.builtin.set_fact:
        lingered_usernames: "{{ lingered_users_find.files | map(attribute='path') | map('basename') | list }}"
    when: lingered_users_find.matched is defined and lingered_users_find.matched > 0
    no_log: true

    - name: Check for existence of rootless Docker service for each user
    ansible.builtin.systemd:
        name: docker
        scope: user
    become: true
    become_user: "{{ item }}"
    loop: "{{ lingered_usernames }}"
    when: lingered_usernames is defined and lingered_usernames | length > 0
    register: service_checks
    ignore_errors: true
    no_log: true

    - name: Identify which services were actually found
    ansible.builtin.set_fact:
        restart_list: "{{ service_checks.results | selectattr('status.LoadState', 'defined') | selectattr('status.LoadState', '!=', 'not-found') | map(attribute='item') | list }}"
    when: lingered_usernames is defined and lingered_usernames | length > 0
    no_log: true

    - name: Restart existing rootless Docker daemons
    ansible.builtin.systemd:
        name: docker
        state: restarted
        scope: user
    become: true
    become_user: "{{ item }}"
    loop: "{{ restart_list }}"
    when: restart_list is defined and restart_list | length > 0
    register: restart_results
    changed_when: false
    no_log: true

    - name: Report on restarted services
    ansible.builtin.debug:
        msg: "Successfully restarted rootless Docker daemon for user '{{ item.item }}'."
    loop: "{{ restart_results.results }}"
    when:
        - restart_list is defined and restart_list | length > 0
        - item.changed

    - name: Check if reboot required
    ansible.builtin.stat:
        path: /var/run/reboot-required
    register: reboot_required_file

    - name: Reboot if required
    ansible.builtin.reboot:
    when: reboot_required_file.stat.exists == true

Configuration: `ansible.cfg`

For the clean output, update your ansible.cfg:

[defaults]
stdout_callback = yaml
display_skipped_hosts = no
display_ok_hosts = no

Summary of tasks

The playbook is designed to be relatively intelligent and robust:

Update packages: Runs a full apt dist-upgrade.
Check for change: Determine if any packages actually changed.
Check for criticality: If changes occurred, check if any of the packages are in the critical_docker_packages list.
Find targets: Only if a critical package update happened, proceed to find users who have rootless services enabled via systemd-linger.
Act selectively: Checks which of those users actually run a docker.service and restart only those specific daemons.
Short report: The script is silent by default. It only prints a one-line report for each daemon it restarted or if it detects that apt has kept critical packages back.

Result

On a day with no relevant changes, the output is minimal:

PLAY RECAP *********************************************************************
docker_services            : ok=3    changed=0    unreachable=0    failed=0    skipped=8    rescued=0    ignored=0
hass_iotdocker             : ok=4    changed=0    unreachable=0    failed=0    skipped=8    rescued=0    ignored=0
...

On a another day when a Docker package update happens, it reports on docker daemon restarts:

TASK [Report on restarted services] ********************************************
ok: [docker_services] => (item=...) => {
    "msg": "Successfully restarted rootless Docker daemon for user 'mastodon'."
}
...

This is a small automation, but it prevents a critical vulnerability (that hit me recently..).

Ansible workflow

For security and ease of management, I run Ansible from a dedicated management VM in my local network. This host contains the Ansible installation, the playbook files, and the inventory of servers to manage.

This setup allows me to trigger multi-host updates with a single command from anywhere.

Inventory

The heart of the setup is the inventory file (inventories/hosts), which tells Ansible which servers to target.

# /srv/ansible/inventories/hosts

[ubuntu]
# Local network VMs
hass_iotdocker ansible_host=192.168.60.15
nextcloud ansible_host=192.168.40.22
node_gitlab ansible_host=192.168.70.11
docker_services ansible_host=192.168.40.81

# A public cloud VM that requires a specific user to connect
aws_vm ansible_host=130.61.20.105 ansible_user=ubuntu

[debian]
iot_influx ansible_host=192.168.60.34
# The management host targets itself to stay updated
management ansible_host=localhost

[all:vars]
# Explicitly set the python interpreter for compatibility
ansible_python_interpreter=/usr/bin/python3

Authentication via SSH Agent Forwarding

One important part of this workflow is authentication. I use SSH Agent Forwarding instead of storing my private keys on the management host.

By connecting to the management host with ssh -A, I allow Ansible to securely use my local machine's SSH agent keys for the duration of the connection. Of course, this means that you must make sure that your management host is properly protected and isolated.

Running the Playbook

To make this a daily one-liner, I use an alias in my local machine's shell configuration (~/.bashrc or ~/.zshrc):

alias daily='ssh -A [email protected] "sh /srv/ansible/update.sh"'

The update.sh script on the management host is a simple wrapper that ensures the playbook runs from the correct directory:

#!/bin/sh

# Purpose: CD into local directory of script
# and run Ansible playbook with local configuration

SCRIPT=$(readlink -f "$0")
SCRIPTPATH=$(dirname "$SCRIPT")

cd "$SCRIPTPATH" || exit
/root/.local/bin/ansible-playbook -vv "apt.yaml"

This setup means I can type daily in my local terminal, and my entire fleet of VMs (local and public) will be updated.

Here is a simple diagram of the flow:

+--------------+      +--------------------+      +-----------------+
|              |      |                    |      |                 |
|    Laptop    |----->|  Management Host   |----->|   Target VMs    |
| (SSH Agent)  |      | (Ansible + Playb.) |      |(nextcloud, etc.)|
+--------------+      +--------------------+      +-----------------+
      |                     |
      | ssh -A alex@...     | ansible-playbook ...
      +---------------------+

Changelog

2025-11-15