使用高保真标签实现 10,000 倍的训练数据量减少。

使用高保真标签实现 10,000 倍的训练数据量减少。
Achieving 10,000x training data reduction with high-fidelity labels

原始链接: https://research.google/blog/achieving-10000x-training-data-reduction-with-high-fidelity-labels/

本研究调查了数据整理过程如何提高小型语言模型（LLM）的性能。研究人员在不同复杂度的任务上，对两个Gemini Nano模型（1.8B和3.25B参数）进行了微调，使用了原始众包数据和迭代整理的数据集。每个众包数据集包含约10万个标签，大量偏向“良性”回复（约95%）。整理过程包括多轮专家评审和数据选择，旨在提高数据质量和平衡性（达到约40%的正面示例）。虽然模型并未达到专家级别的对齐（Cohen’s Kappa .81/.78），但整理过程始终能提高性能。众包数据与专家之间显示出中等程度的一致性（Kappa .59/.41），突出了整理过程在优化噪声标签和增强模型训练方面的价值。该研究在5-6次迭代后结束，因为模型性能在达到专家水平之前趋于平稳。

## Hacker News 讨论总结：AI 训练中的数据量减少一篇最近的谷歌研究论文，关于实现 10,000 倍的训练数据量减少，引发了 Hacker News 的讨论。核心思想是利用 LLM 识别并优先选择对人工标注最具信息量的数据点，从而显著减少对海量数据集的需求。对话很快偏离到网络诈骗和点击诱饵广告的普遍性。用户分享了遇到欺诈广告（尤其是在 Facebook 和 Google 等平台）的经验，并争论这些广告是占在线广告的少量比例还是很大一部分。一些人认为诈骗者利用人们的轻信，而另一些人则认为问题在于低成本的诈骗在网上广泛传播。进一步的讨论涉及主动学习技术、模型训练中不确定性量化的重要性，以及定义和检测点击诱饵的挑战。许多评论员强调了仔细的数据选择的必要性，以及将人工标注工作重点放在模棱两可的案例上的潜在好处。此外，还存在关于研究中使用的聚类方法以及可视化是否故意模糊的争论。

We wanted to understand which models and tasks would benefit most from our curation process. As baselines for our experiments, we fine-tuned two LLMs of different sizes (Gemini Nano-1 with 1.8B parameters and Nano-2 with 3.25B parameters) on two tasks of different complexity (lower and higher, based on expert alignment) using crowdsourced labels. Each crowdsourced data set has ~100K annotations and a strong class imbalance, with around 95% benign labels on average.

We compared each of these four baseline conditions against the corresponding curated condition in which each model (Nano-1 and Nano-2) is fine-tuned over multiple rounds using the curation process described above. At each iteration, we selected our curated set of examples and used them for model evaluation and fine-tuning, as described above. All models plateaued before reaching parity with the experts’ internal alignment, so we stopped at 6 iterations (~400 fine-tuning and ~250 evaluation samples) for the lower complexity task and 5 iterations (~250 fine-tuning and ~150 evaluation samples) for the higher complexity task. (Note that the lower complexity task had a larger variety of examples, which may account for the longer time needed to converge.) Both data sets had a final class balance of ~40% positive examples.

The table below provides an overview of the scale and quality of the data used in each condition. Experts reached an average pairwise Cohen’s Kappa of .81 (on the lower complexity task) and .78 (on the higher complexity task) through the curation process. We consider these the ceiling for model performance. To assess the quality of our crowdsourced data, we calculated Kappa alignment between crowdsourced annotations and experts based on our full curated set, which was .59 (lower complexity) and .41 (higher complexity).

使用高保真标签实现 10,000 倍的训练数据量减少。 Achieving 10,000x training data reduction with high-fidelity labels

使用高保真标签实现 10,000 倍的训练数据量减少。
Achieving 10,000x training data reduction with high-fidelity labels