Fugu-MT 論文翻訳(概要): Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries

論文の概要: Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries

arxiv url: http://arxiv.org/abs/2510.08325v1
Date: Thu, 09 Oct 2025 15:14:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:15.157774
Title: Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries
Title（参考訳）: Pass@kを超えて - 境界を推論するためのBreadth-Depth Metrics
Authors: Marius Dragoi, Ioana Pintilie, Florin Gogianu, Florin Brad,
Abstract要約: モデルが解ける問題の割合を測るCover@tauを提案する。 Pass@kとは異なり、Cover@tauは明確な信頼性閾値の下で推論をキャプチャする。 Cover@tau-based metrics を用いていくつかのRLVRモデルを評価し,Pass@1 と比較してアルゴリズムの相対ランクがどう変化するかを示した。
参考スコア（独自算出の注目度）: 2.9807229517491827
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.
Abstract（参考訳）: Reinforcement Learning with Verifiable Rewards (RLVR)は、コーディング、数学、論理といった推論タスクにおいて、大規模言語モデルを改善するための強力なパラダイムとして登場した。推論境界(モデルが解決できる問題の割合)を評価するために、研究者はしばしば大規模なサンプリング予算でPass@kを報告します。 RLVRモデルは、小さなk値でベースモデルを上回るが、ベースモデルは、非常に多くの完了をサンプリングするときに、通常、それらを上回る。これはベースモデルがより大きな推論境界を持つ証拠として解釈されている。数値出力を持つ数学のような離散的な解空間を持つタスクにおいて、大きな k における Pass@k は真の推論よりも試行回数の制限で成功する確率が増大していることを反映しており、従って誤解を招く可能性がある。モデルが解ける問題の割合を測るCover@tauを提案する。 Pass@kとは異なり、Cover@tauは明確な信頼性のしきい値で推論をキャプチャする。 Cover@tauベースのメトリクスを用いて複数のRLVRモデルを評価し、人気のあるアルゴリズムの相対ランクがPass@1と比較してどのように変化するかを示し、推論境界について異なる視点を提供する。

論文の概要: Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries

関連論文リスト