Fugu-MT 論文翻訳(概要): Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective

論文の概要: Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective

arxiv url: http://arxiv.org/abs/2511.16231v1
Date: Thu, 20 Nov 2025 10:58:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-21 17:08:52.585311
Title: Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective
Title（参考訳）: Pass@k Metric for RLVR: 探索の診断ツールだが、客観的ではない
Authors: Yang Yu,
Abstract要約: 我々は、k個の独立サンプルにおいて少なくとも1つの正しい解を得る確率を測定するpass@kメトリックを分析する。我々の分析によると、pass@kの目的は、探索が最も重要となる体制において、消滅する学習信号を提供する。 pass@kは有用な診断ツールであるが、最適化の直接的な目的には適さないかもしれないと結論付けている。
参考スコア（独自算出の注目度）: 3.79187263097166
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The ability of Large Language Models (LLMs) to perform complex, multi-step reasoning is a central focus of modern AI research. To evaluate and enhance this capability, the pass@k metric, which measures the probability of obtaining at least one correct solution in k independent samples, has received significant attention. Its intuitive appeal has led to its adoption not only as an evaluation standard but also as a direct optimization objective in reinforcement learning. In this paper, we analyze the pass@k objective, derive its gradient, and demonstrate that it is fundamentally a per-example positive reweighting of the simpler pass@1 objective. Our analysis reveals that the pass@k objective provides a vanishing learning signal in regimes where exploration is most critical. We further analyze the dynamics of "exploration collapse", showing that as the policy concentrates probability mass, the gap between pass@k and pass@1 diminishes. We conclude that while pass@k is a useful diagnostic tool, it may be an unsuitable direct objective for optimization. Instead, mechanisms explicitly encouraging efficient exploration could offer a more effective path forward for reinforcement learning in reasoning tasks.
Abstract（参考訳）: 複雑な多段階推論を行うためのLLM(Large Language Models)の能力は、現代のAI研究の中心的な焦点である。この能力を評価・強化するために、k個の独立したサンプルにおいて少なくとも1つの正しい解を得る確率を測定するpass@kメトリックが注目されている。その直感的な魅力は、評価基準としてだけでなく、強化学習における直接的な最適化目標としての採用につながった。本稿では、pass@kの目的を解析し、その勾配を導出し、基本的にはより単純なpass@1の目的のサンプルごとの正の重み付けであることを示す。我々の分析によると、pass@kの目的は、探索が最も重要となる体制において、消滅する学習信号を提供する。さらに「探索崩壊」のダイナミクスを分析し、ポリシーが確率質量に集中すると、pass@kとpass@1のギャップは減少することを示す。 pass@kは有用な診断ツールであるが、最適化の直接的な目的には適さないかもしれないと結論付けている。代わりに、効率的な探索を明示的に奨励するメカニズムは、推論タスクにおける強化学習のためのより効果的な経路を提供する。

論文の概要: Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective

関連論文リスト