探索 Wayland 的碎片化，一个 xdotool 冒险

探索 Wayland 的碎片化，一个 xdotool 冒险
Exploring the Fragmentation of Wayland, an xdotool adventure

原始链接: https://www.semicomplete.com/blog/xdotool-and-exploring-wayland-fragmentation/

## xdotool 与 Wayland 的挑战 xdotool 是一款 UI 自动化工具，始于 2007 年，最初只是一个用于键盘、鼠标和窗口管理的 X11 脚本。多年来，它可靠地利用了 XTest 和 EWMH 等成熟的 X11 功能来控制桌面环境。然而，Wayland 的出现带来了重大挑战。 Wayland 旨在取代 X11，但最初缺乏屏幕共享等关键功能——这些功能在 X11 上已经可用数十年。虽然出于安全考虑删除了这些功能，但这导致了碎片化，不同的合成器（GNOME、KDE、wlroots）以不兼容的解决方案实现了变通方法。目前，执行诸如发送按键之类的基本任务需要浏览复杂的协议网络（XDG Portal、libei、Xwayland），并且经常会触发令人困惑的权限提示，即使对于本地进程也是如此。缺乏统一的方法，迫使开发者为 KDE 等创建特定合成器的工具，例如 kdotool。作为 xdotool 的维护者，作者对这种碎片化表示沮丧，希望在 Wayland 环境中实现 UI 自动化的一致协议。尽管存在困难，但这些功能的需求依然存在，未来的发展方向仍不明确。

最近的 Hacker News 讨论集中在 Wayland 的挑战上，Wayland 是 X Window System 的预期继任者。虽然承认 Wayland 当前存在的问题和碎片化的开发，但评论员指出维护庞大且技术上欠佳的 Xorg 代码库所固有的困难。一位用户认为，Wayland 的开发是由这样一个事实驱动的：鉴于已经转向 Wayland 的专家的能力，现在*修复* Xorg 比构建新的东西更具挑战性。然而，另一位评论员批评 Wayland 竞争性实现方案的激增以及对扩展的无休止争论，并将其与微软在 Windows 中更平稳地过渡到现代合成器进行了对比。讨论的核心强调了在统一 Wayland 的开发以及将用户价值置于内部碎片化之上的领导力缺失。

原文

In 2007, I was spending a my norther-hemisphere summer experimenting with UI automation. Born of those efforts, xdotool came into being when I separated it from another project. The goal was modest - write some scripts that execute common keyboard, mouse, and window management tasks.

The first commit had only a few basic commands - basic mouse and keyboard actions, plus a few window management actions like movement, focus, and searching. Xdotool sprouted new features as time rolled on. Today, the project is 18 years old, and still going!

Time’s forward progress also brought external changes: Wayland came along hoping to replace X11, and later Ubuntu tried to take a bite by launching Mir. Noise about Wayland, both good and bad, floated around for years before major distros began shipping it. It wasn’t until 2016 when Fedora became the first distribution to ship it at all and even then, it was only for GNOME. It would be another five years before Fedora shipped KDE support for Wayland. Ubuntu defaulted to Wayland it in 2017, almost a decade after Wayland began, but switched back to X11 on the next release because screen sharing and remote desktop weren’t available.

Screen sharing and remote desktop. Features that have existed for decades on other systems, that we all knew would be needed? They weren’t available and distros were shipping a default Wayland experience without them. It was a long time before you could join a Zoom call and share your screen. Awkward.

All this to say, Wayland has been a long, bumpy road.

Back to xdotool: xdotool relies on a few X11 features that have existed since before I even started using Linux:

Standard X11 operations - Searching for windows by their title, moving windows, resizing them, closing, etc.
XTest - A means to “test” things on X11 without requiring user action. This provides a way to perform mouse and keyboard actions from your code. You can type, you can move the mouse, you can click.
EWMH - “Extended Window Manager Hints” - A specification for how to communicate with the window manager. This allows xdotool to switch virtual desktops, move windows to other desktops, find processes that own a window, etc.

All of the above is old, stable, and well supported.

Wayland comes along and eliminates everything xdotool can do. Some of that elimination is given excuses that it is “for security” with little found to acknowledge what is being elided and why. It’s fine, I guess… but we ended up with linux distros shipping without significant features that, have, over time, been somewhat addressed. For example, I can share my screen on video calls, now.

So what happened to all of the features elided in the name of security?

Fragmentation is what happened. Buckle up. Almost 10 years into Fedora’s first Wayland release and those elided features are still missing or have multiple implementation proposals with very few maintainers agreeing on what to support. I miss EWMH.

Do you want to send keystrokes and mouse actions?

GNOME 48:
- Xwayland can send keystrokes to the compositor using XTEST. That’s kinda nice, but every few minutes get a popup with almost zero context stating “Allow remote interaction” with a toggle switch. It’s confusing, because sending keystrokes from a local command does not feel like “remote interaction”
- You can write code that uses XDG Portal’s RemoteDesktop session to request access and then use libei to send keystrokes and mouse actions. Documentation is sparse as this is still quite new. However, it still prompts you as above and there appears no permanent way to grant permission, despite the portal api documenting such an option.
KDE
- Xwayland performs similarly when XTEST is used. This time, it pops up “Remote control requested. transient requested access to remotely control: input devices” – It’s confusingly written with hardly any context especially since these popups are new.
Some other compositors support wayland protocol extensions which permit things like virtual keyboard input. Fragmentation continues as there are many protocol extension proposals which add virtual text input, keyboard actions, and mouse/pointer actions. Which ones work, or don’t, depend entirely on what window manager / compositor you are using.

Outside of Wayland, Linux has uinput which allows a program to create, and use, a virtual keyboard and mouse, but it requires root permissions. Further, a keyboard device sends key codes, not symbols, which makes for another layer of difficulty. In order to send the key symbol ‘a’ we need to know what keycode (or key sequence) sends that symbol. In order to do that, you need the keyboard mapping. There are several ways to do this, and it’s not clear which one (Wayland’s wl_keyboard, X11’s XkbGetMap via Xwayland, or XDG RemoteDesktop’s ConnectToEIS which allows you to fetch the keyboard mapping with libei but will cause the confusing Remote Desktop access prompt).

Window management is also quite weird. Wayland appears to have no built-in protocol for one program (like xdotool) asking a window to do anything - be moved, resized, maximized, or closed.

GNOME offers window management only through GNOME Shell Extensions. Javascript apps you install in GNOME and have access to a GNOME-specific Javascript API. Invoking any of these from a shell command isn’t possible without doing some wild maneuvers: GNOME Javascript allows you to access DBus, and you can write code that moves a window ande expose that method over DBus. I’m not the first to consider this, as there are a few published extensions that already do this, such as Focused Window D-Bus. GNOME has a DBus method for executing javascript (org.gnome.Shell.Eval), but it’s disabled by default.
KDE has a similar concept offered by GNOME, but completely incompatible. Luckily, I suppose, KDE also has a DBus method for invoking javascript and, at time of writing, it is enabled by default. A KDE+Wayland-specific derivative of xdotool, kdotool does exactly this to provide a command-line tool which allows you to manage your windows.
Outside of KDE and GNOME, you might find luck with some third-party Wayland protocol extensions. If your compositor is based on wlroots, it’ll likely be usable with wlrctl, a command line tool similar to xdotool and wmctrl. Wlrctl only works if your compositor supports specific, non-default wayland protocols, such as wlr-foreign-toplevel-management.

If we contrast the above with xdotool, today, on X11, perhaps my confusion and wonder become more clear – xdotool works with almost any window manager in X11 - typing, window movement, window search, etc. On Wayland, each compositor will need it’s own unique implementation as shown above with kdotool which only works on Wayland+KDE, not GNOME or anything else.

The fragmentation is perhaps a natural outcome of Wayland promising to focus on the smallest replacement for X11, and that smallness elides a great deal of functionality. The missing features are actually still necessary, like screen sharing, and with no apparent central leadership or community, the outcome feels predictable.

Of third-party Wayland protocols, there are just so many input-related protocols: input method v1, input method v2, text input v3, KDE fake input, and virtual keyboard. And that’s just wayland protocols – the KDE and GNOME XDG RemoteDesktop thingy isn’t even wayland-related at all:

The weirdest thing I’ve learned, here, is the newer XTEST support in Xwayland. The component chain is really wild:

An X11 client sends a key event using XTEST (normal)
XWayland receives it and initiates Remote Desktop XDG Portal session to … your own system (???)
XDG Portal uses DBus in an odd way, with many method calls receiving responses via signals because DBus isn’t designed for long asynchronous methods.
Once the Remote Desktop portal session is setup, Xwayland asks for a file descriptor to talk an libei server (emulated input server).
After that, libei is used to send events, query the keyboard map, etc.
You can ask libei for the keyboard mapping (keycodes to keysyms, etc), you get another file descriptor and process that with yet another library, libxkbcommon.

If Conway’s Law applies to this, then I really want to learn more of the system (of people) that builds this kind of Rube-Goldberg device. Looking back, Wayland folks sent virtual input into the “never, because security!” dumpster bin, so is this the path that routes around those nay-sayers? Wild.

(With respect, the documentation for libei is pretty nice, and I find the C code easy to read - I have no complaints there!)

I’m not alone in finding it very slow on the path to Wayland. Synergy only delivered experimental support for Wayland a year ago, 8 years after Fedora first defaulted to Wayland, and it only happened after GNOME and friends implemented this weird XDG Portal Remote Desktop thing plus libei which seemed to have landed in Xwayland around 2023-ish.

As I learned about libei and XDG Portal recently, I wrote some test code to send some keyboard events. Writing my own software, running on my own machine, GNOME still prompted me “Allow remote interaction?” with seemingly no way to permanently allow my own requests. I’m not the only one confused by GNOME and KDE making such prompts.

Meanwhile, on X11, xdotool runs correctly and without restrictions… The fragmentation is upsetting for me, as the xdotool maintainer, because I want to make this work but am at a loss for how to proceed with all the fragmentation. I don’t mind what the protocol is, but I sure would love to have any protocol that does what I need. Is it worth it to continue?

探索 Wayland 的碎片化，一个 xdotool 冒险 Exploring the Fragmentation of Wayland, an xdotool adventure

探索 Wayland 的碎片化，一个 xdotool 冒险
Exploring the Fragmentation of Wayland, an xdotool adventure