Fugu-MT 論文翻訳(概要): Ranking Reasoning LLMs under Test-Time Scaling

論文の概要: Ranking Reasoning LLMs under Test-Time Scaling

arxiv url: http://arxiv.org/abs/2603.10960v1
Date: Wed, 11 Mar 2026 16:47:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:33.058452
Title: Ranking Reasoning LLMs under Test-Time Scaling
Title（参考訳）: 試験時間スケーリングによるLLMのランク付け
Authors: Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary,
Abstract要約: テストタイムスケーリングは、プロンプト毎に複数の出力をサンプリングすることで、推理LSMを評価する。 Scorioは、ペア比較モデル、アイテム応答理論(IRT)モデル、投票規則、グラフとスペクトルに基づく手法などの統計的ランキング手法を実装したライブラリである。
参考スコア（独自算出の注目度）: 10.821119744235302
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}_{\mathcal{U}}@80$ (mean Kendall's $τ_b = 0.93$--$0.95$), and $19$--$34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $τ_b \approx 0.86$. Using greedy decoding as an empirical prior ($\mathrm{Bayes}_{\mathbf{R}_0}@N$) reduces variance at $N=1$ by $16$--$52\%$, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.
Abstract（参考訳）: テストタイムスケーリングは、プロンプト毎に複数の出力をサンプリングすることで、推理LSMを評価するが、この状態のランキングモデルはまだ未定である。 Scorioは,ペア比較モデル,項目応答理論(IRT)モデル,投票規則,グラフおよびスペクトルに基づく手法などの統計的ランキング手法を実装したライブラリである。 Olympiadスタイルの4つのベンチマーク(AIME'24, AIME'25, HMMT'25, BrUMO'25; 最大$N=80$トライアル)における20ドル以上の推論モデルは、ベイズ金の標準である$\mathrm{Bayes}_{\mathcal{U}}@80$(mean Kendall's $τ_b = 0.93$--$0.95$)と、19$-34$メソッドと完全に一致する。単一審理法では、最良の方法は$τ_b \approx 0.86$に達する。 greedy decoding as an empirical prior ($\mathrm{Bayes}_{\mathbf{R}_0}@N$) は、$N=1$ by $116$-52\%$ で分散を減少させるが、greedy と stochastic sample が一致しない場合のバイアスランク付けは可能である。これらの結果から,高予算および低予算の試験時間スケーリングにおける信頼性の高いランク付け手法が同定された。私たちはScorioをhttps://github.com/mohsenhariri/scorio.comのオープンソースライブラリとしてリリースしています。

論文の概要: Ranking Reasoning LLMs under Test-Time Scaling

関連論文リスト