Fugu-MT 論文翻訳(概要): No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

論文の概要: No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

arxiv url: http://arxiv.org/abs/2509.21880v1
Date: Fri, 26 Sep 2025 05:03:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.192561
Title: No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping
Title（参考訳）: 後ろにプロンプトがない:エントロピー誘導アドバンテージシェイピングによるLLM強化学習におけるゼロ変数プロンプトの爆発
Authors: Thanh-Long V. Le, Myeongho Jeon, Kim Vu, Viet Lai, Eunho Yang,
Abstract要約: ゼロ分散プロンプトから学習信号を抽出する新しいアルゴリズムであるゼロ分散プロンプト(RL-ZVP)を導入する。 RL-ZVPは、応答を対比することなく、直接正しさを報償し、エラーを罰する。 6つの数学推論ベンチマークで、RL-ZVPはGRPOよりも最大8.61ポイント、パスレート7.77ポイントの大幅な改善を実現している。
参考スコア（独自算出の注目度）: 35.34724727629745
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward - so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.
Abstract（参考訳）: RLVR(Reinforcement Learning with Verifiable Rewards)は、大規模言語モデル(LLM)の推論能力を改善するための強力なフレームワークである。しかし、GRPOのような現在の方法は、モデル応答が同じ入力に対して異なる問題のみに依存し、全ての応答が同じ報酬を受ける問題を無視している(いわゆるゼロ分散プロンプト)。本研究では、このようなプロンプトは役に立たないが、実際、政策最適化に有意義なフィードバックを提供することができると論じる。そこで本研究では,ゼロ分散プロンプトから学習信号を抽出するアルゴリズムであるZero-Variance Prompts (RL-ZVP)を導入する。 RL-ZVPは、応答をコントラストすることなく直接正しさを報償し、エラーをペナライズし、トークンレベルの特性でフィードバックを変調して、情報的なニュアンス信号を保存する。 6つの数学推論ベンチマークで、RL-ZVPはGRPOよりも最大8.61ポイント、パスレート7.77ポイントの大幅な改善を実現している。これらの結果は、RLVRにおけるゼロ分散プロンプトから学習する未解決の可能性を浮き彫りにした。

論文の概要: No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

関連論文リスト