Fugu-MT 論文翻訳(概要): ExpRL: Exploratory RL for LLM Mid-Training

論文の概要: ExpRL: Exploratory RL for LLM Mid-Training

arxiv url: http://arxiv.org/abs/2606.17024v1
Date: Mon, 15 Jun 2026 17:50:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 18:36:05.127531
Title: ExpRL: Exploratory RL for LLM Mid-Training
Title（参考訳）: ExpRL:LLMミッドトレーニングのための探査RL
Authors: Violet Xiang, Amrith Setlur, Chase Blagden, Nick Haber, Aviral Kumar,
Abstract要約: スパース報酬強化学習(RL)はLLM推論を改善するための標準ツールとなっている。より自動化されたアプローチについて検討する: emphRL に基づく中級訓練において、人間による質問応答データの大規模なコーパスを用いて検討する。参照はポリシーから隠され、問題固有のグレーディングルーブを構築するためにのみ使用される。
参考スコア（独自算出の注目度）: 40.4311030968937
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse reward reinforcement learning (RL) has become a standard tool for improving LLM reasoning, but its success depends critically on the coverage present in the base model. In practice, models are often primed for RL through \emph{mid-training} on curated reasoning traces that teach useful primitive skills such as decomposition, verification, or self-correction. Although effective, this strategy requires manually specifying what the model should learn, and it remains unclear whether such primitive coverage is enough for much harder problems, which require combining these skills into broader solution strategies. We study a more automated approach: \emph{RL-based mid-training} using large corpora of human-written question-answer data. Rather than treating reference solutions as targets to imitate, our method, ExpRL, uses them as \emph{reward scaffolds}: references are hidden from the policy and used only to construct problem-specific grading rubrics for judging on-policy reasoning traces. The policy samples from the original problem prompt, while an LLM judge compares the sampled reasoning trace against the reference solution and assigns outcome-level or process-level dense rewards. This lets ExpRL reinforce partial progress, useful intermediate reductions, and productive reasoning behaviors that sparse final-answer rewards often fail to upweight. On challenging math reasoning tasks, ExpRL yields stronger RL priming than SFT, sparse-reward GRPO, and self-distillation, and provides a better initialization for subsequent sparse-reward RL. Additional mixed-domain experiments further suggest that ExpRL can extend beyond the original math-only setting.
Abstract（参考訳）: スパース報酬強化学習(RL)はLLM推論を改善するための標準ツールとなっているが、その成功はベースモデルに存在するカバレッジに大きく依存する。実際には、モデルはしばしば、分解、検証、自己補正のような有用な原始的スキルを教えるキュレートされた推論トレースに基づいて、emph{mid-training} を通じて RL に優先順位付けされる。この戦略は効果的ではあるが、手動でモデルが何を学習するかを指定する必要があり、そのような原始的カバレッジがより難しい問題に十分であるかどうかは不明だ。より自動化されたアプローチについて検討する: 人書き質問応答データの大規模なコーパスを用いて、emph{RL-based mid-training} について検討する。リファレンスソリューションを模倣するターゲットとして扱うのではなく、ExpRL はそれらを 'emph{reward scaffolds} として利用する。 LLM審査員は、サンプルされた推論トレースを基準解と比較し、結果レベルまたはプロセスレベルの厳密な報酬を割り当てる。これによりExpRLは部分的な進歩、有用な中間還元、そして最終回答の報酬が不足する生産的推論行動を強化することができる。挑戦的な数学推論タスクでは、ExpRL は SFT よりも強い RL プライミング、スパースリワードGRPO 、自己蒸留が得られ、その後のスパースリワード RL に対してより良い初期化を提供する。さらに混合領域の実験により、ExpRLは元の数学のみの設定を超えて拡張可能であることが示唆された。

論文の概要: ExpRL: Exploratory RL for LLM Mid-Training

関連論文リスト