Fugu-MT 論文翻訳(概要): Don't Pass$\mathtt{@}k$: A Bayesian Framework for Large Language Model Evaluation

論文の概要: Don't Pass$\mathtt{@}k$: A Bayesian Framework for Large Language Model Evaluation

arxiv url: http://arxiv.org/abs/2510.04265v1
Date: Sun, 05 Oct 2025 16:14:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.548234
Title: Don't Pass$\mathtt{@}k$: A Bayesian Framework for Large Language Model Evaluation
Title（参考訳）: Don't Pass$\mathtt{@}k$: 大規模言語モデル評価のためのベイズ的フレームワーク
Authors: Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary,
Abstract要約: Pass$@k$ は LLM の推論のパフォーマンスを報告するのに広く使われているが、不安定で誤解を招くようなランキングを得ることが多い。本稿では、Pass$@k$をモデルの基本成功確率と信頼区間の後方推定に置き換える原理的ベイズ評価フレームワークを提案する。
参考スコア（独自算出の注目度）: 4.082208996639461
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pass$@k$ is widely used to report performance for LLM reasoning, but it often yields unstable, misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Code is available at https://mohsenhariri.github.io/bayes-kit
Abstract（参考訳）: Pass$@k$はLLM推論のパフォーマンスを報告するために広く使われているが、特にトライアルの数(サンプル)が限られ、計算が制限されている場合、不安定で誤解を招くことが多い。本稿では,Pass$@k$と平均精度を$N$トライアル(avg$@N$)に置き換えたベイズ評価フレームワークを提案する。評価結果は(0/1ではなく)ディリクレ前のカテゴリーとしてモデル化され、後部の平均に対する閉形式表現と重み付けされたルーリックの不確実性を与え、適切であれば事前証拠の使用を可能にする。理論的には、一様の事前の下では、ベイズ平均は平均精度に等しい(Pass$@1$)。 AIME'24/'25、HMMT'25、BrUMO'25のシミュレーションでは、ベイジアン/アヴグ法はパス$$k$や最近の変種よりも高速な収束と階数安定性を実現し、はるかに小さなサンプル数で信頼性の高い比較を可能にする。このフレームワークは、観測されたギャップが統計的に有意な(重複しない信頼区間)かノイズであるかを明確にし、自然に格付けされたルーリックに基づく評価にまで拡張する。これらの結果は、LSM評価のためのPass$k$と、不確実性を明示しながらバイナリと非バイナリの評価を統一する後方ベースの計算効率のプロトコルに置き換えることを推奨している。コードはhttps://mohsenhariri.github.io/bayes-kitで入手できる。

論文の概要: Don't Pass$\mathtt{@}k$: A Bayesian Framework for Large Language Model Evaluation

関連論文リスト