Fugu-MT 論文翻訳(概要): Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling

論文の概要: Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling

arxiv url: http://arxiv.org/abs/2512.19905v1
Date: Mon, 22 Dec 2025 22:13:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-24 19:17:49.676258
Title: Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling
Title（参考訳）: LLM-as-a-Judge:推論時間スケーリングのための解析的トラクタブルモデル
Authors: Indranil Halder, Cengiz Pehlevan,
Abstract要約: 推論時間スケーリングを解析的に抽出可能なモデルを導入する。我々は,これらの事実を大言語モデル推論で実験的に検証し,さらに大きな言語モデルを判断する。
参考スコア（独自算出の注目度）: 34.69440744042684
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent developments in large language models have shown advantages in reallocating a notable share of computational resource from training time to inference time. However, the principles behind inference time scaling are not well understood. In this paper, we introduce an analytically tractable model of inference-time scaling: Bayesian linear regression with a reward-weighted sampler, where the reward is determined from a linear model, modeling LLM-as-a-judge scenario. We study this problem in the high-dimensional regime, where the deterministic equivalents dictate a closed-form expression for the posterior predictive mean and variance. We analyze the generalization error when training data are sampled from a teacher model. We draw $k$ inference-time samples and select via softmax at a temperature applied to a quadratic reward. When the reward is not too different from the teacher, the generalization error decreases monotonically with increasing inference time samples $k$. However, the specific reward that optimizes inference-time selection generally differs from the teacher. In contrast, substantial reward misspecification induces a finite optimal $k$ beyond which more sampling can increase the generalization error. For fixed $k$, there exists an optimal sampling temperature. We experimentally verify these facts in large language model inference with an additional large language model as a judge. In the "best-of-$k$" limit with the teacher as reward, we theoretically show that the generalization error decays as $Θ(1/k^2)$ and determine the leading coefficient via extreme value theory. These formulas delineate domains where scaling inference-time computation is provably preferable to collecting more data. Finally, we demonstrate that when task difficulty increases, the previously mentioned advantage of inference-time compute degrades.
Abstract（参考訳）: 大規模言語モデルの最近の発展は、トレーニング時間から推論時間まで、計算資源の顕著なシェアを割り当てることの利点を示している。しかし、推論時間のスケーリングの背後にある原則はよく理解されていない。本稿では,LLM-as-a-judgeシナリオをモデル化し,線形モデルから報酬が決定される報酬重み付きサンプルを用いたベイズ線形回帰モデルを提案する。決定論的同値が後続予測平均と分散の閉形式表現を定式化する高次元状態において,この問題を考察する。教師モデルからトレーニングデータをサンプリングする際の一般化誤差を解析する。我々は、$k$の推論時間サンプルを描画し、2次報酬に適用される温度でソフトマックスで選択する。報酬が教師とそれほど変わらない場合、一般化誤差は推論時間サンプルの$k$の増加とともに単調に減少する。しかし、推論時間選択を最適化する特定の報酬は、一般的に教師と異なる。対照的に、実質的な報酬の誤特定は有限の最適$k$を誘導し、より多くのサンプリングが一般化誤差を増大させる。固定$k$の場合、最適なサンプリング温度が存在する。我々は,これらの事実を大言語モデル推論で実験的に検証し,さらに大きな言語モデルを判断する。教師を報酬として持つ「最高のk$」極限では、一般化誤差が$1/k^2)$として崩壊し、極値理論を通じて先頭係数を決定することが理論的に示される。これらの公式は、より多くのデータを集めるよりも、スケーリングの推論時間計算が確実に好ましい領域を記述している。最後に、タスクの難易度が大きくなると、前述の推論時間計算の利点が劣化することを示した。

論文の概要: Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling

関連論文リスト