来自人类反馈的强化学习
Reinforcement Learning from Human Feedback

原始链接: https://arxiv.org/abs/2504.12501

内森·兰伯特(Nathan Lambert)的著作《从人类反馈中进行强化学习》(RLHF)全面介绍了这项日益重要的技术,该技术用于部署现代机器学习系统。本书面向具有定量背景的读者,追溯了RLHF的起源,涵盖了经济学和控制理论等多个领域,并建立了基础定义和数学工具。 本书的核心详细介绍了完整的RLHF优化过程:从指令调优开始,经过奖励模型训练,最终达到拒绝采样和直接对齐等技术。它涵盖了已建立的方法和新兴算法。 最后,本书探讨了高级主题,包括合成数据、评估挑战以及该领域的开放研究问题,为理解和贡献RLHF的未来提供了一项宝贵的资源。该著作自2025年4月首次发布以来,经过迭代修订,直至目前的2026年1月版本。

黑客新闻 新的 | 过去的 | 评论 | 提问 | 展示 | 工作 | 提交 登录 从人类反馈中进行强化学习 (arxiv.org) 13 分,by onurkanbkrc 1 小时前 | 隐藏 | 过去的 | 收藏 | 1 条评论 klelatti 13 分钟前 [–] 带有链接等的网页版本:https://rlhfbook.com/reply 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

View a PDF of the paper titled Reinforcement Learning from Human Feedback, by Nathan Lambert

View PDF HTML (experimental)
Abstract:Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.
From: Nathan Lambert [view email]
[v1] Wed, 16 Apr 2025 21:36:46 UTC (5,200 KB)
[v2] Wed, 11 Jun 2025 15:15:22 UTC (7,032 KB)
[v3] Sun, 2 Nov 2025 20:03:47 UTC (7,093 KB)
[v4] Fri, 2 Jan 2026 00:09:40 UTC (8,065 KB)
[v5] Sat, 17 Jan 2026 17:17:41 UTC (8,732 KB)
联系我们 contact @ memedata.com