Proxmox 升级历程
Adventures in upgrading Proxmox

原始链接: https://blog.vasi.li/adventures-in-upgrading-proxmox/

## 家庭实验的烦恼:Docker、Proxmox 和一次失败的升级 尝试在 Proxmox 8 家庭实验环境中部署 Coolify/Dokploy 时,发现问题源于在 LXC 容器内运行 Docker。最近的 `runc` 漏洞(仅在较新的 `pve-lxc` 包中修复,而这些包仅在 Proxmox 9 中可用)是根本原因。 为了解决这个问题而进行的升级,在一个运行 NVR 和 Coral TPU 的节点上引发了一系列问题。升级过程中,一个 DKMS 模块(Apex 驱动程序)未能重新构建,导致系统崩溃,需要物理 KVM 连接才能恢复。失败的升级还影响了 Proxmox 集群的仲裁,导致容器操作停止。 更糟糕的是,一个新的 Zigbee 适配器与旧的适配器冲突,导致智能家居设备无法使用。恢复到之前的内核后,出现了一个启动问题(“无法挂载根文件系统”),通过使用 `proxmox-boot-tool` 和特定的挂载命令重新生成 initrd 镜像来解决。 最终,升级后重建 Apex DKMS 模块需要修补源代码,因为内核 API 发生了变化——该修复方案来自 Reddit 用户。系统现已恢复,但这次经历凸显了嵌套容器的复杂性以及升级过程中可能出现的级联故障。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 升级 Proxmox 的冒险 (vasi.li) 13 分,由 speckx 发表于 50 分钟前 | 隐藏 | 过去 | 收藏 | 1 条评论 evanjrowley 发表于 8 分钟前 [–] 我周末也遇到了同样的问题。我的 Proxmox 设置的最终目标基本上和你的一样。很高兴看到社区快速解决了这个问题。回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Running docker inside LXC is weird. It's containers on top of other container, and there was a fairly recent issue with AppArmor that prevented some functionality from running inside a docker container with very cryptic error. I was trying to deploy coolify and/or dokploy in my homelab and hitting all sorts of weird issues. Eventually I've found this GitHub issue for runc, and, apparently, it was fixed in the new version of pve-lxc package. But I'm still on Proxmox 8, and the new version seemingly only available in Proxmox 9.

I've upgraded one node without much hassle, but the second node, the one that runs my NVR and has the Coral TPU, that one gave me some grief. Because Apex drivers are installed as a DKMS module, it failed to rebuild, which interrupted the system upgrade process. Not sure how exactly, but after the reboot the system did not come back online. The machine is in the basement, which means I have to take my USB KVM and make a trip downstairs...

💡

As an aside... Because one node didn't start, and my Proxmox cluster has only two nodes, it can't reach quorum, meaning I can't really make any changes to my other node, and I can't start any containers that are stopped.
I've recently added another Zigbee dongle, that supports Thread, and it happens to share same VID:PID combo as the old dongle, so due to how these were mapped into guest OS, all my light switches stopped working. I had to fix the issue fast.

Thankfully I was able to reach the GRUB screen and pick previous kernel, so I could boot into the machine. That was a plus, but trying to reboot into the new kernel still caused panic.

Google suggested that the unable to mount rootfs on unknown-block(0,0) error indicates an issue with missing initrd, which needs to be regenerated with update-initramfs -u -k ${KERNEL_VERSION}. It ran successfully, albeit with somewhat cryptic no /etc/kernel/proxmox-boot-uuids found message. After reboot it kernel-panicked again, even though the /boot/initrd-${VERSION} files were present. I guess that error is relevant. After another quick Google search I've found this Reddit thread which provided the steps to solve this issue.

lsblk -o +FSTYPE | grep /boot/efi # understand which device the EFI partition is on
unount /boot/efi
proxmox-boot-tool init /dev/${DEVICE} # plug in device from step 1
mount /boot/efi
update-initramfs -u -k all
reboot

This generated the necessary file and after rebooting the system was able to boot again with the new kernel.

While trying to troubleshoot I've also uninstalled the Apex DKMS module, and now I had to re-install it again, but it started failing with errors because of the kernel change.

Apparently some symbols/API's where obsoleted and I had to patch the source code. Upstream seemingly did not have it, but I found the necessary changes:

diff --git a/src/gasket_core.c b/src/gasket_core.c
index b1c2726..88bd5b2 100644
--- a/src/gasket_core.c
+++ b/src/gasket_core.c
@@ -1373,7 +1373,9 @@ static long gasket_ioctl(struct file *filp, uint cmd, ulong arg)
 /* File operations for all Gasket devices. */
 static const struct file_operations gasket_file_ops = {
        .owner = THIS_MODULE,
+#if LINUX_VERSION_CODE < KERNEL_VERSION(6,0,0)
        .llseek = no_llseek,
+#endif
        .mmap = gasket_mmap,
        .open = gasket_open,
        .release = gasket_release,
diff --git a/src/gasket_page_table.c b/src/gasket_page_table.c
index c9067cb..0c2159d 100644
--- a/src/gasket_page_table.c
+++ b/src/gasket_page_table.c
@@ -54,7 +54,7 @@
 #include <linux/vmalloc.h>
 
 #if __has_include(<linux/dma-buf.h>)
-MODULE_IMPORT_NS(DMA_BUF);
+MODULE_IMPORT_NS("DMA_BUF");
 #endif
 
 #include "gasket_constants.h"

After doing this and re-running the build process (as outlined in the previous post), the driver installed and I was able to bring back frigate.

Big thanks to /u/Dunadan-F for the solution.

联系我们 contact @ memedata.com