Fugu-MT 論文翻訳(概要): Context Bootstrapped Reinforcement Learning

論文の概要: Context Bootstrapped Reinforcement Learning

arxiv url: http://arxiv.org/abs/2603.18953v1
Date: Thu, 19 Mar 2026 14:23:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:06.195374
Title: Context Bootstrapped Reinforcement Learning
Title（参考訳）: コンテキストブートストラップによる強化学習
Authors: Saaket Agashe, Jayanth Srinivasa, Gaowen Liu, Ramana Kompella, Xin Eric Wang,
Abstract要約: Reinforcement Learning from Verifiable Rewards (RLVR) は、探索の非効率さに悩まされている。我々は,数発のデモをトレーニングプロンプトに先立ってRLVRトレーニングを増強するContextped Bootstrapped Reinforcement Learning (CBRL)を提案する。 CBRLは、成功率を一貫して改善し、探索効率を向上し、アルゴリズムに依存しない。
参考スコア（独自算出の注目度）: 51.213972559315486
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum that starts high to bootstrap early exploration, then anneals to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL's practical applicability on Q, a domain-specific programming language that diverges significantly from mainstream language conventions.
Abstract（参考訳）: RLVR(Reinforcement Learning from Verifiable Rewards)は、モデルがロールアウトを成功させるのに苦労する探索の非効率さに悩まされ、結果として学習信号が最小になる。この課題は、新しい推論パターンやドメイン固有の知識の獲得を必要とするタスクに対して特に深刻である。そこで本研究では,数発のプロンプトを確率的に予測してRLVRトレーニングを増強するContext Bootstrapped Reinforcement Learning (CBRL)を提案する。インジェクション確率は、初期探索をブートストラップするために始まるカリキュラムに従っており、その後0に鎮痛し、結局は援助なしにモデルが成功する必要がある。これにより、ポリシーは、テスト時にそれらに頼るのではなく、デモから推論パターンを内部化する。 2つのモデルファミリーと5つのReasoning GymタスクにまたがってCBRLを検証する。以上の結果から,CBRLは連続的に成功率を向上し,探索効率を向上し,アルゴリズムに依存しないことを示す。さらに、CBRLのQ(ドメイン固有プログラミング言語)への実践的適用性を実証する。

論文の概要: Context Bootstrapped Reinforcement Learning

関連論文リスト