Fugu-MT 論文翻訳(概要): Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization

論文の概要: Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization

arxiv url: http://arxiv.org/abs/2603.24093v1
Date: Wed, 25 Mar 2026 08:52:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.217361
Title: Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization
Title（参考訳）: 効果的な経験的学習に向けて:利用と内部化のための二重指導
Authors: Fei Bai, Zhipeng Chen, Chuan Hao, Ming Yang, Ran Tao, Bryan Dai, Wayne Xin Zhao, Jian Yang, Hongteng Xu,
Abstract要約: トレーニング効率を向上させるために、textbfDual textbfGuidance textbfOptimization(textbfDGO)を提案する。
参考スコア（独自算出の注目度）: 71.41478888201401
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, reinforcement learning~(RL) has become an important approach for improving the capabilities of large language models~(LLMs). In particular, reinforcement learning from verifiable rewards~(RLVR) has emerged as a promising paradigm for reasoning tasks. However, existing RL-based training still remains only a rough approximation to human learning. Human learners leverage both external and internal experience to guide exploration and gradually internalize useful trajectories into stable knowledge. Motivated by this gap, we ask: how can LLMs better utilize and internalize experience during RLVR training? To answer this question, we propose \textbf{D}ual \textbf{G}uidance \textbf{O}ptimization~(\textbf{DGO}), a unified framework that leverages \emph{external} and \emph{internal experience} to improve training effectiveness. Specifically, DGO first constructs an experience bank from previously explored trajectories. The policy then performs exploration under the joint guidance of the experience bank and the model's internal knowledge. The resulting trajectories are further used to refine the experience bank and optimize model parameters, forming a closed loop of experience utilization and internalization. Experiments show that DGO consistently outperforms baseline methods, suggesting that better utilization and internalization of experience lead to more effective reasoning.
Abstract（参考訳）: 近年,強化学習~(RL)は,大規模言語モデル~(LLM)の能力向上のための重要なアプローチとなっている。特に、検証可能な報酬(RLVR)からの強化学習が、推論タスクの有望なパラダイムとして現れている。しかし、既存のRLベースのトレーニングは、人間の学習に対する大まかな近似に留まっている。人間の学習者は、外部と内部の両方の経験を活用して探索をガイドし、徐々に有用な軌道を安定した知識に内部化する。 LLMはどのようにしてRLVRトレーニングで経験をうまく活用し、内部化できますか? この疑問に答えるために、トレーニング効率を向上させるために、 \emph{external} と \emph{internal experience} を活用する統一フレームワークである \textbf{D}ual \textbf{G}uidance \textbf{O}ptimization~(\textbf{DGO})を提案する。具体的には、DGOは最初に、以前に調査された軌道から経験銀行を構築する。この方針は、経験銀行とモデルの内部知識の共同指導の下で調査を行う。得られた軌道はさらに、経験銀行を洗練させ、モデルパラメータを最適化し、経験利用と内部化の閉ループを形成するために使われる。実験の結果、DGOはベースライン法を一貫して上回り、より良い利用と経験の内部化がより効果的な推論につながることが示唆された。

論文の概要: Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization

関連論文リスト