``` XMPP 和元数据 ```

``` XMPP 和元数据 ```
XMPP and Metadata

原始链接: https://blog.mathieui.net/xmpp-and-metadata.html

## XMPP 与元数据：摘要本次演讲最初在混沌通信大会上发表，探讨了 XMPP 消息协议中的元数据问题。虽然 XMPP 提供了可扩展性和联合，但即使使用端到端加密 (E2EE)，每条消息的发送都会向相关服务器泄露信息——发送者、接收者和时间。信任您的服务器至关重要，因为服务器被攻破会暴露所有元数据。演讲概述了四种主要的元数据威胁：服务器被攻破、实时数据关联、利用静态服务器数据以及网络层面的观察。讨论了几种潜在的解决方案，包括无服务器消息传递 (XEP-0174)，它绕过了服务器但牺牲了加密，以及 XTLS，用于直接、加密的客户端到客户端连接。加密身份 (XEP-0416) 提供了另一种减少对服务器信任的途径。将 XMPP 与其他协议进行了比较：Signal 虽然在密码学上很强大，但它是中心化的并收集元数据；Matrix 在服务器之间复制数据，加剧了元数据问题；SimpleX 优先考虑隐私，内置了密码学和洋葱路由。最终，改进 XMPP 的元数据处理需要持续的努力，尽管资源有限。虽然采用可能缓慢，但增量改进可以加强协议和生态系统。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 XMPP 和元数据 (mathieui.net) 15 分，by todsacerdoti 2 小时前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

I had the pleasure of giving a talk on "XMPP and Metadata" during the last Chaos Communication Congress, in the Critical Decentralization Cluster area. It was my first public presentation in a very long while (also in english), so the talk went okay-ish at best. The end of the year was also hectic and I did not manage to prepare or rehearse as much as I would have liked to.

This blog post will be a longer, more complete version of the talk. You can nonetheless find the talk slides on the CDC pretalx. Thanks a lot to the people who proofread the blog post to fix stuff or suggest additional content.

This was about metadata, but also generally data retention and what the server sees in general.

Obvious message workflow

This might be too obvious for most people, but for clarity’s sake, I want to assert that to send a message to another entity, you need:

a sender
a message
a receiver

This is not technical, this is baked into the concept of sending a message. Those elements will always be present somewhere in the workflow. Assuming a working encryption system, the message data itself will not be considered.

There are, however, some technical tricks that can hide a lot of things from the infrastructure layer.

XMPP

I cannot really make this an introduction to XMPP but to summarize, XMPP is an extensible federated protocol for messaging and presence. It is using XML for the most part but nobody should care (except trolls, I guess). It started in 1999 as Jabber, and grew to be an IETF standard under the name XMPP after Jabber got bought by Cisco (we can still use the Jabber name in many ways).

The protocol started server-heavy with light clients - and in fact, you will read as much in the "XMPP, the definitive guide" book -, but the trend got reversed in the last decade due to the rise of mobile clients which can be updated very often and other circumstances.

There are clients and servers, and it is therefore a protocol made of client-to-server interactions and server-to-server interactions, each with their own privacy implications.

The key elements to remember in order to assess threats in the XMPP network fabric would be:

Your server is the only entity sending data to other servers. Every single bit of XML your XMPP client sends goes through your server.
Other servers will only see your interactions with their own users.

Those two points are true in most non-P2P models. Centralized models can be thought as a specialization of this model, but with only one single server.

That is why it is essential to rely on a server you can trust, either operated by people you trust, or at least who have some accountability in place, for example the services listed on providers.xmpp.net.

Threats

I can roughly point out four types of "passive" threats on metadata for XMPP:

A server compromise (present)
- Correlation of data streams in real time
A server compromise (future)
- Exploitation of the static data available on the server
An attacker present on the server network
- Can see what the server does (both with clients and servers)
An attacker on the client (your) network
- Can see what your client does

Server compromise

Live stream interception

As stated earlier, choosing a server you trust is the very first step, and if you do not trust your server (and operators) at all, why are you there?

If the server itself is compromised, the point-to-point TLS encryption between client and servers, and between servers, becomes very much useless, which means "server network attacker" scenarios can now be matched with 100% precision, and more:

Correlation between sender and receiver is exact
The user’s XMPP address can be mapped to an IP and port
Stanza type is exposed, whether <message/>, <iq/>, <presence/> stanzas are sent, everything is known
Activity patterns can get a lot more detailed
E2EE still works to protect data - but can get disrupted -

Some solutions for XMPP

Other services/protocols

Signal

Signal is a pioneer in encryption systems at scale, and keeps pushing the boundaries of what is possible to do securely. Nonetheless, their messenger is centralized, with systems running on AWS and Azure (as far as I can tell), which makes them very dependent on the US political wasteland as well as the tech landlords’ whims.

They do a lot of things to ensure things are as secure as it can be within the constraints they imposed on themselves, and as such while I trust them for now, their servers certainly have a lot of opportunities to collect and store a ton of metadata, simply due to the fact that this is a centralized system. While their cryptography work is class-leading, which makes my data secure (as long as someone does not bust the secure enclave which protects my "recovery code" I guess), keeping my metadata volatile and secure there is only a leap of faith on my part, as I can have no guarantees.

A diagram showing every client connects to a single entity, because signal is centralized.

Matrix

Matrix is a federated protocol which has many of the same flaws as XMPP with regards to metadata.

One notable difference is that matrix is more like a distributed database with built-in conflict resolution, which leads to every participant’s server replicating the state (data and metadata) of the rooms they are in. This creates a more difficult situation for metadata than XMPP, because while XMPP servers can see what goes through them, in Matrix the servers are required to store this information.

Matrix also has two different sets of APIs for client-to-server and server-to-server communication, which should allow it to batch messages when appropriate.

SimpleX

SimpleX is a protocol with a lot of cryptography baked in, and has interesting properties such as the absence of user accounts and therefore identifiers (which means very little data on the servers can be compromised).

One of its more interesting properties is that it has 2-stage onion routing baked in, which allows it to sidestep many issues around metadata due to connecting to servers.

(credit: Wikipedia)

The whitepaper stresses that it is still important to choose your servers well, but that is still less critical than in XMPP since you can easily switch at very little cost.

(P.S.: calling your protocol "SMP" is not nice if it is not based on the Socialist Millionaire Protocol, I haven’t checked but skimming did not reveal any mention of it)

Conclusion

As Daniel noted on mastodon, there are some low-hanging fruits to improving the metadata issues around XMPP (and some higher-hanging fruits as well). I agree that this is not going to be even a blip on XMPP adoption, but we should do what we can nonetheless to improve the situation, in order to improve the standard and the ecosystem. That said, I can perfectly understand that since a lot of the work is purely volunteer-driven and our time and energy is limited, it can appear to be a waste of time to dedicate them to removing bits of metadata here and there.