Fugu-MT 論文翻訳(概要): Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only

論文の概要: Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only

arxiv url: http://arxiv.org/abs/2510.21090v1
Date: Fri, 24 Oct 2025 02:02:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 09:00:15.356785
Title: Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only
Title（参考訳）: 自己回帰型PPO: デモのみによる大規模言語モデルの調整
Authors: Qingru Zhang, Liang Qiu, Ilgee Hong, Zhenghao Xu, Tianyi Liu, Shiyang Li, Rongzhi Zhang, Zheng Li, Lihong Li, Bing Yin, Chao Zhang, Jianshu Chen, Haoming Jiang, Tuo Zhao,
Abstract要約: Supervised Fine-tuning (SFT) は、大規模な言語モデルと人間のアノテーションによる実演を整合させる重要な方法として登場した。本稿では, 自己回帰型PPOを提案する。
参考スコア（独自算出の注目度）: 70.43369087819332
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Supervised fine-tuning (SFT) has emerged as a crucial method for aligning large language models (LLMs) with human-annotated demonstrations. However, SFT, being an off-policy approach similar to behavior cloning, often struggles with overfitting and poor out-of-domain generalization, especially in limited-data scenarios. To address these limitations, we propose Self-Rewarding PPO, a novel fine-tuning method that leverages on-policy techniques to enhance generalization performance. Our approach combines the strengths of SFT and proximal policy optimization (PPO) to achieve more effective alignment from demonstration data. At its core is a reward function designed as the log policy ratio between the SFT model and the pretrained base model. This function serves as an implicit reward signal, using the pretrained policy as a baseline and the SFT policy as a target. By doing so, it enables on-policy fine-tuning without relying on human preference annotations. The integration of this self-rewarding mechanism with PPO addresses key limitations of SFT, improving generalization, data efficiency, and robustness. Our empirical evaluation across a range of natural language processing tasks demonstrates that Self-Rewarding PPO consistently outperforms traditional SFT methods. The results highlight the effectiveness of our approach in aligning LLMs using demonstration data, particularly in scenarios where high-quality annotated data is scarce.
Abstract（参考訳）: Supervised Fine-tuning (SFT) は、大型言語モデル(LLM)と人間のアノテーションによる実演を整合させる重要な手法として登場した。しかしながら、SFTは行動クローニングと類似した非政治的なアプローチであり、特に限られたデータシナリオにおいて、オーバーフィットとドメイン外の一般化に苦慮することが多い。これらの制約に対処するため, 自己回帰型PPOを提案する。提案手法は,実証データからより効果的なアライメントを実現するために,SFTとPPOの長所を組み合わせたものである。中心となるのは、SFTモデルと事前訓練されたベースモデルのログポリシー比として設計された報酬関数である。この機能は、事前訓練されたポリシーをベースラインとして、SFTポリシーをターゲットとして、暗黙の報酬信号として機能する。これにより、人間の好みのアノテーションに頼ることなく、政治上の微調整が可能になる。この自己回帰機構とPPOの統合は、SFTの重要な制限に対処し、一般化、データ効率、堅牢性を改善する。自然言語処理タスクにおける経験的評価は、自己回帰型PPOが従来のSFT手法よりも一貫して優れていることを示す。その結果,特に高品質な注釈付きデータが不足するシナリオにおいて,実演データを用いたLLMの整列化におけるアプローチの有効性を強調した。

論文の概要: Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only

関連論文リスト