Fugu-MT 論文翻訳(概要): On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

論文の概要: On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

arxiv url: http://arxiv.org/abs/2508.11408v1
Date: Fri, 15 Aug 2025 11:20:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-18 14:51:23.932323
Title: On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
Title（参考訳）: On-Policy RL - 動的重み付けによる教師付き微調整と強化学習の調和
Authors: Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou,
Abstract要約: Supervised Fine-Tuning (SFT) と Reinforcement Learning (RL) は、大規模言語モデル(LLM)の能力の強化と振る舞いの整合化のための訓練後パラダイムである。 SFTとRLを統合する既存のアプローチは、確立されたモデルパターンを混乱させ、専門家データに過度に適合させるリスクに直面することが多い。動的重み付けによるオン・アンド・オフ・ポリティクス強化学習の制御可能な調和のためのフレームワークであるCHORDを提案する。
参考スコア（独自算出の注目度）: 71.64063986651819
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from expert tokens, which preserves on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on widely used benchmarks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.
Abstract（参考訳）: Supervised Fine-Tuning (SFT) と Reinforcement Learning (RL) は、Large Language Models (LLM) の能力の強化と振る舞いの整合化のためのトレーニング後パラダイムである。 SFTとRLを統合する既存のアプローチは、確立されたモデルパターンを混乱させ、専門家データに過度に適合させるリスクに直面することが多い。そこで本稿では,SFT と RL の統一的な視点を,オフ・ポリティクスとオン・ポリティクス・レンズで検討する。我々は,SFTを個別の段階ではなく,オンラインRLプロセス内で動的に重み付けされた補助目的として再編成する動的重み付けによる,オン・アンド・オフ・ポリシィ強化学習の制御可能調和のためのフレームワークCHORDを提案する。全体的および粒度レベルでの非政治専門家データの影響の分析に基づいて、CHORDに二重制御機構を組み込む。具体的には、まずグローバルな係数を用いて、オフ・ポリティクスの模倣からオン・ポリティクスの探索へと移行し、その後、オン・ポリティクスの探索を保存し、オフ・ポリティクスデータの破壊を緩和する専門家トークンからの粒度学習を可能にするトークンワイズ重み付け関数を適用した。我々は広く使われているベンチマークで広範な実験を行い、CHORDが安定かつ効率的な学習プロセスを実現するという実証的な証拠を提供する。法外の専門家データと法外探査とを効果的に調和させることで、CHORDは基準線よりも大幅に改善されていることを示す。我々は、さらなる研究を刺激するために、https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chordで実装をリリースします。

論文の概要: On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

関連論文リスト