Fugu-MT 論文翻訳(概要): Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification

論文の概要: Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification

arxiv url: http://arxiv.org/abs/2603.03336v1
Date: Wed, 11 Feb 2026 18:16:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:08.166465
Title: Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification
Title（参考訳）: 不確実な量子化を伴う大規模言語モデルのプロンプト依存ランク付け
Authors: Angel Rodrigo Avelar Menendez, Yufeng Liu, Xiaowu Dai,
Abstract要約: 我々は、ペアワイズな人選好の下で、プロンプト依存のランキング推定について検討する。我々は統計的に妥当な不確実性保証を有する意思決定安全ランキングの枠組みを開発する。
参考スコア（独自算出の注目度）: 9.99813918008511
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Rankings derived from pairwise comparisons are central to many economic and computational systems. In the context of large language models (LLMs), rankings are typically constructed from human preference data and presented as leaderboards that guide deployment decisions. However, existing approaches rely on point estimates, implicitly treating rankings as fixed objects despite substantial estimation noise and context-dependent performance variation. Acting on such rankings can lead to misallocation and welfare loss when apparent differences are not statistically meaningful. We study prompt-dependent ranking inference under pairwise human preferences and develop a framework for decision-safe rankings with statistically valid uncertainty guarantees. We model preferences using a contextual Bradley-Terry-Luce model in which the latent utility of each model depends on the input prompt. Rather than targeting point estimates of utilities, we directly conduct inference on induced rankings, constructing confidence sets based on simultaneous confidence intervals for pairwise utility differences. This approach yields statistically valid marginal and simultaneous confidence sets for prompt-specific ranks. Our framework connects recent advances in rank inference to contextual preference learning and provides tools for robust ranking-based decision-making. Empirically, using large-scale human preference data from LLM evaluations, we show that rankings vary substantially across prompt characteristics and that many apparent rank differences are not statistically distinguishable. We further demonstrate how uncertainty-aware rankings identify dominance only when supported by the data and otherwise return partial orders.
Abstract（参考訳）: ペア比較から導かれるランクは、多くの経済・計算システムの中心である。大規模言語モデル(LLM)の文脈では、ランキングは通常、人間の好みのデータから構築され、デプロイメント決定を導くリーダーボードとして提示される。しかし、既存のアプローチでは、かなりの推定ノイズと文脈に依存したパフォーマンスの変動にもかかわらず、ランク付けを固定オブジェクトとして暗黙的に扱い、点推定に依存している。このようなランク付けを行うと、明らかな相違が統計的に意味をなさない場合、転職や福祉の損失につながる可能性がある。本研究では, 統計的に妥当な不確実性保証を有する意思決定安全ランキングの枠組みを, 対人選好下での即時順位推定について検討した。我々は、各モデルの潜在ユーティリティが入力プロンプトに依存する文脈的Bradley-Terry-Luceモデルを用いて、好みをモデル化する。ユーティリティのポイント推定を対象とするのではなく,相互に有効性の違いを推定するために,同時信頼区間に基づく信頼セットの構築を行う。このアプローチは、プロンプト固有のランクに対する統計的に有効な限界と同時信頼セットを与える。我々のフレームワークは、最近のランク推論の進歩と文脈的嗜好学習を結びつけ、ロバストなランキングベースの意思決定のためのツールを提供する。実験では, LLM評価から得られた大規模人選データを用いて, 目覚しい特徴によってランクが著しく異なること, 目覚しいランクの違いが統計的に区別できないことが確認された。さらに、不確実性を考慮したランキングが、データによって支持された場合にのみ支配を識別し、それ以外は部分的な順序を返すことを実証する。

論文の概要: Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification

関連論文リスト