Fugu-MT 論文翻訳(概要): Beyond Imitation: Recovering Dense Rewards from Demonstrations

論文の概要: Beyond Imitation: Recovering Dense Rewards from Demonstrations

arxiv url: http://arxiv.org/abs/2510.02493v1
Date: Thu, 02 Oct 2025 18:58:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-06 16:35:52.137684
Title: Beyond Imitation: Recovering Dense Rewards from Demonstrations
Title（参考訳）: 想像を超えて - デモからディエンス・リワードを回復する
Authors: Jiangnan Li, Thuy-Trang Vu, Ehsan Abbasnejad, Gholamreza Haffari,
Abstract要約: 教師付き微調整は単純な模倣学習プロセスとして扱われ、データセット上の専門家の振る舞いを模倣するポリシーを訓練するのみである。我々は、SFTプロセスが政策を学習するだけでなく、専門家のデモンストレーションを説明する暗黙の、密集したトークンレベルの報酬モデルも示している。 Dense-Path ReINFORCEは命令追従ベンチマークにおいて、元のSFTモデルよりも一貫して優れている。
参考スコア（独自算出の注目度）: 64.05543657441218
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Conventionally, supervised fine-tuning (SFT) is treated as a simple imitation learning process that only trains a policy to imitate expert behavior on demonstration datasets. In this work, we challenge this view by establishing a fundamental equivalence between SFT and Inverse Reinforcement Learning. We prove that the SFT objective is a special case of Inverse Q-Learning, which implies that the SFT process does not just learn a policy, but also an implicit, dense, token-level reward model that explains the expert demonstrations. We then show how to recover this dense reward signal directly from the SFT model by formulating a baseline-relative reward function. The availability of such a dense reward model offers numerous benefits, providing granular credit assignment for each token generated. We demonstrate one key application by using these recovered rewards to further improve the policy with reinforcement learning. Our method, Dense-Path REINFORCE, consistently outperforms the original SFT models on instruction-following benchmarks. This work reframes SFT not merely as policy imitation but as a powerful reward learning mechanism, opening new possibilities for leveraging expert demonstrations.
Abstract（参考訳）: 従来、教師付き微調整(SFT)は、実証データセットに専門家の行動を模倣するポリシーを訓練する単純な模倣学習プロセスとして扱われてきた。本研究では,SFTと逆強化学習の基本的な等価性を確立することで,この考え方に挑戦する。我々は、SFTの目的が逆Q-Learningの特殊な場合であることを証明し、これはSFTプロセスが政策を学ぶだけでなく、専門家のデモンストレーションを説明する暗黙の、密集したトークンレベルの報酬モデルでもあることを示唆している。次に、ベースライン相対報酬関数を定式化することにより、SFTモデルから直接この高密度報酬信号を復元する方法を示す。このような高密度報酬モデルの可用性は、生成されたトークンごとに詳細なクレジット代入を提供する、数多くのメリットを提供する。得られた報酬を用いて、強化学習による政策改善を図り、一つの重要な応用例を示す。我々の手法であるDense-Path ReINFORCEは、命令追従ベンチマークにおいて、元のSFTモデルよりも一貫して優れています。この研究は、SFTを単なる政策模倣ではなく、強力な報酬学習メカニズムとして再編成し、専門家のデモンストレーションを活用する新たな可能性を開く。

論文の概要: Beyond Imitation: Recovering Dense Rewards from Demonstrations

関連論文リスト