Fugu-MT 論文翻訳(概要): Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

論文の概要: Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

arxiv url: http://arxiv.org/abs/2506.07527v1
Date: Mon, 09 Jun 2025 08:11:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-10 21:10:47.125766
Title: Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions
Title（参考訳）: 強化学習ができないものを学ぶ - 最も難しい質問に対するオンラインファインチューニングのインターリーブ
Authors: Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, Wentao Zhang,
Abstract要約: 大規模言語モデル(LLM)推論は、強化学習(RL)を通して計画や自己回帰のような洗練された行動が現れることを示した。 textbfReLIFT (textbfReinforcement textbfL textbfInterleaved with Online textbfFine-textbfTuning) ReLIFTでは、モデルを主にRLを使ってトレーニングするが、難しい問題に遭遇すると、ファインチューニングのための高品質なソリューションが収集され、トレーニングプロセスが交互に行われる。
参考スコア（独自算出の注目度）: 28.962415274754537
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model's reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13\% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.
Abstract（参考訳）: 大規模言語モデル(LLM)推論の最近の進歩は、強化学習(RL)を通して、計画や自己回帰のような洗練された行動が現れることを示している。しかし、これらの成功にもかかわらず、RLの現在の形式は、新しい情報の獲得を促進するのではなく、モデルの既存の知識に基づいて最適化されているため、ベースモデルの限界を超える能力を誘導するには不十分である。この制限に対処するために、教師付き微調整(SFT)を用いて、RLができないことを学習し、高品質な実演データを活用することで、新しい知識と推論パターンを組み込むことを可能にした。我々は,LLM推論におけるRLとSFTのトレーニングダイナミクスを解析し,RLがモデルの本来の能力内における質問の維持と改善に優れており,SFTはモデルの現在の範囲を超えた質問の進行を可能にするのに有効であることを示す。 RL と SFT の相補的な強みに触発され,新しいトレーニング手法である \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning)を導入する。 ReLIFTでは、モデルを主にRLを使用してトレーニングするが、難しい問題に遭遇すると、微調整のための高品質なソリューションが収集され、RLと微調整の間でトレーニングプロセスが交互に行われ、モデルの推論能力が向上する。 ReLIFTは、他のゼロRLモデルと比較して、5つの競合レベルベンチマークと1つのアウト・オブ・ディストリビューションベンチマークで、平均5.2ポイント以上の改善を実現している。さらに、ReLIFTがRLとSFTの両方より優れており、詳細な実演データの13%しか使用していないことを実証し、そのスケーラビリティを強調した。これらの結果は、ReLIFTがRLの基本的な限界を克服し、有意義な可能性を裏付ける証拠となる。

論文の概要: Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

関連論文リスト