Fugu-MT 論文翻訳(概要): Exploiting Tree Structure for Credit Assignment in RL Training of LLMs

論文の概要: Exploiting Tree Structure for Credit Assignment in RL Training of LLMs

arxiv url: http://arxiv.org/abs/2509.18314v2
Date: Thu, 02 Oct 2025 20:38:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-06 14:21:29.884376
Title: Exploiting Tree Structure for Credit Assignment in RL Training of LLMs
Title（参考訳）: LLMのRLトレーニングにおけるクレジットアサインメントのための木構造の検討
Authors: Hieu Tran, Zonghai Yao, Hong Yu,
Abstract要約: 強化学習は推論を改善するが、長いシーケンスよりも遅れた報酬はトークンレベルのクレジット割り当てを重要なボトルネックにする。最終回答がチェック可能で、プロンプト毎に複数の応答を描画できる検証可能な逆設定について検討する。 textbfTEMPO (emphtextbfTree-textbfEstimated textbfMean Prefix Value for textbfPolicy textbfOptimization)を提案する。
参考スコア（独自算出の注目度）: 11.64053639889468
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce \textbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes \emph{nonparametric} prefix values \(V(s)\) by aggregating descendant outcomes. Built on P2T, we propose \textbf{TEMPO} (\emph{\textbf{T}ree-\textbf{E}stimated \textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with \emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.
Abstract（参考訳）: 強化学習はLLM推論を改善するが、長いシーケンスよりも少ない遅延報酬はトークンレベルのクレジット割り当てを重要なボトルネックにする。最終回答がチェック可能で、プロンプト毎に複数の応答を描画できる検証可能な逆設定について検討する。数学と医学のQAにおけるタスクの推論は、いくつかの決定トークンだけが結果に大きな影響を及ぼすこの設定と一致します。 PPOは、学習された価値モデルでトークンレベルの利点を提供するが、アクターと批評家モデルの両方を同時に訓練することは複雑であり、批判モデルのトークンレベルの値が過度に適合する傾向にあるため、容易に一般化できない。 GRPOは批判のない、検証可能な報酬をサポートするが、トークンに単一のシーケンスレベルの戻り値を広げ、分岐を無視する。これは、応答のグループをプレフィックスツリーに変換し、子孫の結果を集約することで、emph{nonparametric}プレフィックス値 \(V(s)\) を計算する単純な手順である。 P2T に基づいて構築された GRPO (\emph{\textbf{T}ree-\textbf{E}stimated \textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}) は、木から派生した時間差補正によりGRPO の集団相対的な結果信号を強化する。非分岐トークンでは、時間差(TD)項はゼロなので、TEMPOはGRPOに還元される。 Qwen3-1.7B/4B では、TEMPO は in-distriion (MATH, MedQA) と out-of-distriion (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) のベンチマークで PPO と GRPO を上回り、ほぼ同じ壁時計時間で高い検証精度に達する。

論文の概要: Exploiting Tree Structure for Credit Assignment in RL Training of LLMs

関連論文リスト