驾驭嵌入的普适几何
Harnessing the Universal Geometry of Embeddings

原始链接: https://arxiv.org/abs/2505.12540

Rishi Jha及其同事提出了一种新颖的无监督方法,用于在不同的向量空间之间翻译文本嵌入,无需配对数据、编码器或预定义匹配。他们的方法利用了一种通用的潜在表示,根据柏拉图式表征假设,该表示被认为反映了“普遍的语义结构”。这种方法有效地将不同架构、参数数量和训练数据的模型之间的嵌入进行转换,同时保持较高的余弦相似度。这意味着来自一个模型的嵌入可以准确地表示在另一个模型的向量空间中。该研究强调了向量数据库中潜在的安全漏洞:即使仅访问嵌入向量,攻击者也可以提取有关源文档的敏感信息,从而实现分类和属性推断。该论文于2025年5月18日发表,并于2025年5月20日进行了修订。

Jack, author of a new paper on arXiv, discusses a method for translating text embeddings between vector spaces without paired data, a feat some experts believed impossible. This has implications for embedding security, as it demonstrates that embeddings are not inherently encrypted, even without model access. The method builds upon the "Platonic Representation Hypothesis" and applies to models trained on large, diverse internet datasets. The research tackles a harder problem than simply aligning embeddings of the same data; it aims to generate a comparable embedding in a different model's space given an embedding of unknown text. Questions arose regarding detecting concepts encodable in one model but not another, and potentially modifying models to prevent representation of specific ideas. While alignment methods exist, the authors argue their approach is unique in translating embeddings without requiring seed dictionaries or pre-matched data. Concerns were raised about the generality of the name "vec2vec" and the lack of comparison to existing alignment algorithms. The author also acknowledges the possibility of watermarking embeddings but acknowledges the tradeoff with embedding quality.
相关文章
  • 嵌入:它们是什么以及它们为何重要 2023-10-26
  • 词嵌入被低估了 2025-05-12
  • (评论) 2025-05-18
  • (评论) 2024-06-08
  • (评论) 2024-04-18

  • 原文

    View a PDF of the paper titled Harnessing the Universal Geometry of Embeddings, by Rishi Jha and 3 other authors

    View PDF HTML (experimental)
    Abstract:We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets.
    The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for the security of vector databases. An adversary with access only to embedding vectors can extract sensitive information about the underlying documents, sufficient for classification and attribute inference.
    From: Rishi Jha [view email]
    [v1] Sun, 18 May 2025 20:37:07 UTC (3,179 KB)
    [v2] Tue, 20 May 2025 15:38:41 UTC (3,180 KB)
    联系我们 contact @ memedata.com