Related papers: Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling

Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling

URL: http://arxiv.org/abs/2512.19905v1
Date: Mon, 22 Dec 2025 22:13:06 GMT
Title: Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling
Authors: Indranil Halder, Cengiz Pehlevan,
Abstract summary: We introduce an analytically tractable model of inference-time scaling.<n>We experimentally verify these facts in large language model inference with an additional large language model as a judge.
Score: 34.69440744042684
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent developments in large language models have shown advantages in reallocating a notable share of computational resource from training time to inference time. However, the principles behind inference time scaling are not well understood. In this paper, we introduce an analytically tractable model of inference-time scaling: Bayesian linear regression with a reward-weighted sampler, where the reward is determined from a linear model, modeling LLM-as-a-judge scenario. We study this problem in the high-dimensional regime, where the deterministic equivalents dictate a closed-form expression for the posterior predictive mean and variance. We analyze the generalization error when training data are sampled from a teacher model. We draw $k$ inference-time samples and select via softmax at a temperature applied to a quadratic reward. When the reward is not too different from the teacher, the generalization error decreases monotonically with increasing inference time samples $k$. However, the specific reward that optimizes inference-time selection generally differs from the teacher. In contrast, substantial reward misspecification induces a finite optimal $k$ beyond which more sampling can increase the generalization error. For fixed $k$, there exists an optimal sampling temperature. We experimentally verify these facts in large language model inference with an additional large language model as a judge. In the "best-of-$k$" limit with the teacher as reward, we theoretically show that the generalization error decays as $Θ(1/k^2)$ and determine the leading coefficient via extreme value theory. These formulas delineate domains where scaling inference-time computation is provably preferable to collecting more data. Finally, we demonstrate that when task difficulty increases, the previously mentioned advantage of inference-time compute degrades.

Related papers

On the Power of (Approximate) Reward Models for Inference-Time Scaling [3.540245474029962]
Inference-time scaling has emerged as a powerful paradigm for improving the reasoning capability of large language models.<n>All deployed systems rely on approximate reward models, raising a fundamental question: Why and when do approximate reward models suffice for effective inference-time scaling?<n>We identify the Bellman error of the approximate reward model as the key quantity governing the effectiveness of SMC-based inference-time scaling.
arXiv Detail & Related papers (2026-02-01T18:28:42Z)
Learning Shrinks the Hard Tail: Training-Dependent Inference Scaling in a Solvable Linear Model [2.7074235008521246]
We analyze neural scaling laws in a solvable model of last-layer fine-tuning where targets have intrinsic, instance-heterogeneous difficulty.<n>We show that learning shrinks the hard tail'' of the error distribution.
arXiv Detail & Related papers (2026-01-07T10:00:17Z)
Compute-Optimal LLMs Provably Generalize Better With Scale [102.29926217670926]
We develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime.<n>We introduce a novel, fully empirical Freedman-type martingale concentration that tightens existing bounds by accounting for the variance of the loss function.<n>We produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.
arXiv Detail & Related papers (2025-04-21T16:26:56Z)
Leveraging Sparsity for Sample-Efficient Preference Learning: A Theoretical Perspective [16.610925506252716]
Minimax optimal estimation error rate $Theta(d/n)$ in classical estimation theory requires that the number of samples $n$ scales linearly with the dimensionality of the feature space $d$.<n>High dimensionality of the feature space and the high cost of collecting human-annotated data challenge the efficiency of traditional estimation methods.<n>We show that under the sparse random utility model, where the parameter of the reward function is $k$-sparse, the minimax optimal rate can be reduced to $Theta(k/n log(d/k))
arXiv Detail & Related papers (2025-01-30T11:41:13Z)
Lifted Coefficient of Determination: Fast model-free prediction intervals and likelihood-free model comparison [0.0]
We derive model-free prediction intervals that become tighter as the correlation between predictions and observations increases. These intervals motivate the $textitLifted Coefficient of Determination$, a model comparison criterion for arbitrary loss functions. We extend the prediction intervals to more general error distributions, and propose a fast model-free outlier detection algorithm for regression.
arXiv Detail & Related papers (2024-10-11T16:27:31Z)
Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup.<n>We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$.<n>Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z)
Amortizing intractable inference in diffusion models for vision, language, and control [89.65631572949702]
This paper studies amortized sampling of the posterior over data, $mathbfxsim prm post(mathbfx)propto p(mathbfx)r(mathbfx)$, in a model that consists of a diffusion generative model prior $p(mathbfx)$ and a black-box constraint or function $r(mathbfx)$.<n>We prove the correctness of a data-free learning objective, relative trajectory balance, for training a diffusion model that samples from
arXiv Detail & Related papers (2024-05-31T16:18:46Z)
Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions.<n>We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance.<n>Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z)
Scaling and renormalization in high-dimensional regression [72.59731158970894]
We present a unifying perspective on recent results on ridge regression.<n>We use the basic tools of random matrix theory and free probability, aimed at readers with backgrounds in physics and deep learning.<n>Our results extend and provide a unifying perspective on earlier models of scaling laws.
arXiv Detail & Related papers (2024-05-01T15:59:00Z)
A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z)
Precise Learning Curves and Higher-Order Scaling Limits for Dot Product Kernel Regression [41.48538038768993]
We focus on the problem of kernel ridge regression for dot-product kernels. We observe a peak in the learning curve whenever $m approx dr/r!$ for any integer $r$, leading to multiple sample-wise descent and nontrivial behavior at multiple scales.
arXiv Detail & Related papers (2022-05-30T04:21:31Z)
Optimization Variance: Exploring Generalization Properties of DNNs [83.78477167211315]
The test error of a deep neural network (DNN) often demonstrates double descent. We propose a novel metric, optimization variance (OV), to measure the diversity of model updates.
arXiv Detail & Related papers (2021-06-03T09:34:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.