探索 Pangram 3.3.2 的内部表征

探索 Pangram 3.3.2 的内部表征
Exploring the internal representations of Pangram 3.3.2

原始链接: https://www.pangram.com/pangram-space

这项研究探讨了“人性化工具”（humanizers）——即用于修改人工智能生成文本以规避检测的对抗性工具。为了对其进行分析，作者整理了一个包含 1,900 个样本的数据集，这些样本均由十种未公开的人性化服务处理而成。尽管标准的检测器难以区分人性化文本与真正的人类写作，但研究表明，模型内部的表征揭示了不同的情况。在模型的嵌入空间中，人性化文本并未与人类或人工智能样本融合，而是形成了独特且孤立的簇。研究人员推测，虽然模型在内部将人性化文本识别为一个独立的类别，但最终输出层无法对其进行一致的分类。为了验证这一点，他们训练了一个三向线性探针来区分人类文本、人工智能文本和人性化文本。该探针达到了 98% 的准确率，证实了尽管存在对抗性修改，模型仍具备区分人性化内容与真实人类写作的内在能力。

``` Hacker News 最新 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交登录探索 Pangram 3.3.2 的内部表示 (pangram.com) 8 分，由 krackers 发布于 53 分钟前 | 隐藏 | 往期 | 收藏 | 1 条评论帮助 Chu4eeno 10 分钟前 [–] 我想知道如果他们有足够多来自个人的素材，是否也能将他们区分开来？看起来他们的模型确实是在学习识别某种通用的作者“语体”（我假设他们的最后一层只是知道哪些语体应该被标记为什么）。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：```

Humanizers are a class of adversarial tools designed to modify AI-generated text in a manner that evades AI detectors.2We previously published a paper on these tools here. To see where humanized text sits relative to human and AI text in activation space, we created a separate humanizers dataset, which consists of roughly 1,900 samples, roughly balanced across three generative models (Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT-5), ten different humanizer services, and the same source domains as the original interpretability dataset. Because of the adversarial risks, we do not disclose which services we use.

How the Model Reads Out Humanizers

Certain samples from our humanizer dataset are indeed challenging for our model to detect. Here, we use the same linear probe for the human/AI task, except with humanized text labelled as AI, as we do in the original training setup. We see that even from the first layer, humanized text is consistently read out as more human than its direct AI counterpart.

Humanizer probe delta across layers — Fig. 8. Mean P(AI) difference between direct AI samples and their humanized counterparts across layers.

Where Humanizers Exist in Embedding Space

However, when we look beneath the final readout, we find a much richer representation of humanized text. Below, we apply our dimensionality reduction methods to the human, AI, and humanized texts. Qualitatively, we can observe that humanizers tend to occupy separate parts of activation space, and form clusters outside of the human and AI regions.

Our hypothesis is that despite not having labels for humanized text, the model is capable of distinguishing between humanized, human, and AI text. However, in the final readout, the model is forced to collapse that signal and does so inconsistently.

Humanizer dimensionality reduction with t-SNE, PCA, and UMAP at layer 40 — Fig. 9. Human, AI, and humanized text at layer 40 via t-SNE, PCA, and UMAP. Humanized text occupies a distinct region separate from the main human and AI clusters.

Probe

To validate this hypothesis, we train a three-way linear probe with labels for AI, human, and humanized text. The probe reaches high top-1 accuracy early in the network and eventually flattens out at 98%.