Published: 2025-11-15, Revised: 2025-11-15


Ansible and Docker logos intertwined


TL;DR Running apt upgrade on hosts with rootless Docker services can break them. A version mismatch occurs between the running user daemon and the newly upgraded system binaries, causing containers to fail on restart. This post provides an ansible playbook that detects critical package changes and automatically restarts only the necessary rootless user daemons, preventing downtime and manual intervention.

Info

This playbook is the result of a deep-dive into a specific failure mode of the rootless Docker architecture. For context on the initial setup, please see my previous posts on setting up rootless Docker for services like Mastodon.

Motivation

As detailed in my rootless Docker setup guide, this architecture provides pretty good security isolation. However, it has a vulnerability: When system packages provided by docker-ce-rootless-extras (like containerd and its shims) are upgraded via apt, the running user-level Docker daemons become outdated.

This leads to a version mismatch. When a container is restarted, the old daemon tries to use the new on-disk shim, which causes a fatal error (e.g. unsupported shim version (3): not implemented). The solution is to restart the user's Docker daemon after every critical update, which is a perfect task for automation via ansible.

Ansible Playbook

This playbook automates the entire update and remediation process. It's designed to be run daily via a cron job, staying silent unless it needs to take action.

The Playbook: apt.yaml

Click to view
---
- hosts: ubuntu, debian
become: yes
become_method: sudo

vars:
    ansible_pipelining: true
    ansible_ssh_common_args: '-o ControlMaster=auto -o ControlPersist=60s'
    # This forces Ansible to use /tmp for its temporary files, avoiding
    # any permission issues when becoming a non-root user.
    ansible_remote_tmp: /tmp
    # critical package that require a docker restart
    critical_docker_packages:
    - docker-ce
    - docker-ce-cli
    - containerd.io
    - docker-ce-rootless-extras
    - docker-buildx-plugin
    - docker-compose-plugin
    - systemd

tasks:
    - name: Ensure en_US.UTF-8 locale is present on target hosts
    ansible.builtin.locale_gen:
        name: en_US.UTF-8
        state: present
    # We only run this once per host, not for every user
    run_once: true

    - name: "Update cache & Full system update"
    ansible.builtin.apt:
        update_cache: true
        upgrade: dist
        cache_valid_time: 3600
        force_apt_get: true
        autoremove: true
        autoclean: true
    environment:
        NEEDRESTART_MODE: automatically
    register: apt_result
    changed_when: "'0 upgraded, 0 newly installed, 0 to remove' not in apt_result.stdout"
    no_log: true

    - name: Report on any packages that were kept back
    ansible.builtin.debug:
        msg: "WARNING: Packages were kept back on {{ inventory_hostname }}. Manual review may be needed. Output: {{ apt_result.stdout }}"
    when:
        - apt_result.changed
        - "'packages have been kept back' in apt_result.stdout"

    - name: Find all users with lingering enabled
    ansible.builtin.find:
        paths: /var/lib/systemd/linger
        file_type: file
    register: lingered_users_find
    when:
        - apt_result.changed
        # This checks if any of the critical package names appear in the apt output
        - critical_docker_packages | select('in', apt_result.stdout) | list | length > 0
    no_log: true

    - name: Create a list of lingered usernames
    ansible.builtin.set_fact:
        lingered_usernames: "{{ lingered_users_find.files | map(attribute='path') | map('basename') | list }}"
    when: lingered_users_find.matched is defined and lingered_users_find.matched > 0
    no_log: true

    - name: Check for existence of rootless Docker service for each user
    ansible.builtin.systemd:
        name: docker
        scope: user
    become: true
    become_user: "{{ item }}"
    loop: "{{ lingered_usernames }}"
    when: lingered_usernames is defined and lingered_usernames | length > 0
    register: service_checks
    ignore_errors: true
    no_log: true

    - name: Identify which services were actually found
    ansible.builtin.set_fact:
        restart_list: "{{ service_checks.results | selectattr('status.LoadState', 'defined') | selectattr('status.LoadState', '!=', 'not-found') | map(attribute='item') | list }}"
    when: lingered_usernames is defined and lingered_usernames | length > 0
    no_log: true

    - name: Restart existing rootless Docker daemons
    ansible.builtin.systemd:
        name: docker
        state: restarted
        scope: user
    become: true
    become_user: "{{ item }}"
    loop: "{{ restart_list }}"
    when: restart_list is defined and restart_list | length > 0
    register: restart_results
    changed_when: false
    no_log: true

    - name: Report on restarted services
    ansible.builtin.debug:
        msg: "Successfully restarted rootless Docker daemon for user '{{ item.item }}'."
    loop: "{{ restart_results.results }}"
    when:
        - restart_list is defined and restart_list | length > 0
        - item.changed

    - name: Check if reboot required
    ansible.builtin.stat:
        path: /var/run/reboot-required
    register: reboot_required_file

    - name: Reboot if required
    ansible.builtin.reboot:
    when: reboot_required_file.stat.exists == true

Configuration: ansible.cfg

For the clean output, update your ansible.cfg:

[defaults]
stdout_callback = yaml
display_skipped_hosts = no
display_ok_hosts = no

Summary of tasks

The playbook is designed to be relatively intelligent and robust:

  1. Update packages: Runs a full apt dist-upgrade.
  2. Check for change: Determine if any packages actually changed.
  3. Check for criticality: If changes occurred, check if any of the packages are in the critical_docker_packages list.
  4. Find targets: Only if a critical package update happened, proceed to find users who have rootless services enabled via systemd-linger.
  5. Act selectively: Checks which of those users actually run a docker.service and restart only those specific daemons.
  6. Short report: The script is silent by default. It only prints a one-line report for each daemon it restarted or if it detects that apt has kept critical packages back.

Result

On a day with no relevant changes, the output is minimal:

PLAY RECAP *********************************************************************
docker_services            : ok=3    changed=0    unreachable=0    failed=0    skipped=8    rescued=0    ignored=0
hass_iotdocker             : ok=4    changed=0    unreachable=0    failed=0    skipped=8    rescued=0    ignored=0
...

On a another day when a Docker package update happens, it reports on docker daemon restarts:

TASK [Report on restarted services] ********************************************
ok: [docker_services] => (item=...) => {
    "msg": "Successfully restarted rootless Docker daemon for user 'mastodon'."
}
...

This is a small automation, but it prevents a critical vulnerability (that hit me recently..).


Ansible workflow

For security and ease of management, I run Ansible from a dedicated management VM in my local network. This host contains the Ansible installation, the playbook files, and the inventory of servers to manage.

This setup allows me to trigger multi-host updates with a single command from anywhere.

Inventory

The heart of the setup is the inventory file (inventories/hosts), which tells Ansible which servers to target.

# /srv/ansible/inventories/hosts

[ubuntu]
# Local network VMs
hass_iotdocker ansible_host=192.168.60.15
nextcloud ansible_host=192.168.40.22
node_gitlab ansible_host=192.168.70.11
docker_services ansible_host=192.168.40.81

# A public cloud VM that requires a specific user to connect
aws_vm ansible_host=130.61.20.105 ansible_user=ubuntu

[debian]
iot_influx ansible_host=192.168.60.34
# The management host targets itself to stay updated
management ansible_host=localhost

[all:vars]
# Explicitly set the python interpreter for compatibility
ansible_python_interpreter=/usr/bin/python3

Authentication via SSH Agent Forwarding

One important part of this workflow is authentication. I use SSH Agent Forwarding instead of storing my private keys on the management host.

By connecting to the management host with ssh -A, I allow Ansible to securely use my local machine's SSH agent keys for the duration of the connection. Of course, this means that you must make sure that your management host is properly protected and isolated.

Running the Playbook

To make this a daily one-liner, I use an alias in my local machine's shell configuration (~/.bashrc or ~/.zshrc):

alias daily='ssh -A [email protected] "sh /srv/ansible/update.sh"'

The update.sh script on the management host is a simple wrapper that ensures the playbook runs from the correct directory:

#!/bin/sh

# Purpose: CD into local directory of script
# and run Ansible playbook with local configuration

SCRIPT=$(readlink -f "$0")
SCRIPTPATH=$(dirname "$SCRIPT")

cd "$SCRIPTPATH" || exit
/root/.local/bin/ansible-playbook -vv "apt.yaml"

This setup means I can type daily in my local terminal, and my entire fleet of VMs (local and public) will be updated.

Here is a simple diagram of the flow:

+--------------+      +--------------------+      +-----------------+
|              |      |                    |      |                 |
|    Laptop    |----->|  Management Host   |----->|   Target VMs    |
| (SSH Agent)  |      | (Ansible + Playb.) |      |(nextcloud, etc.)|
+--------------+      +--------------------+      +-----------------+
      |                     |
      | ssh -A alex@...     | ansible-playbook ...
      +---------------------+
Changelog

2025-11-15