On the Power of (Approximate) Reward Models for Inference-Time Scaling
- URL: http://arxiv.org/abs/2602.01381v1
- Date: Sun, 01 Feb 2026 18:28:42 GMT
- Title: On the Power of (Approximate) Reward Models for Inference-Time Scaling
- Authors: Youheng Zhu, Yiping Lu,
- Abstract summary: Inference-time scaling has emerged as a powerful paradigm for improving the reasoning capability of large language models.<n>All deployed systems rely on approximate reward models, raising a fundamental question: Why and when do approximate reward models suffice for effective inference-time scaling?<n>We identify the Bellman error of the approximate reward model as the key quantity governing the effectiveness of SMC-based inference-time scaling.
- Score: 3.540245474029962
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inference-time scaling has recently emerged as a powerful paradigm for improving the reasoning capability of large language models. Among various approaches, Sequential Monte Carlo (SMC) has become a particularly important framework, enabling iterative generation, evaluation, rejection, and resampling of intermediate reasoning trajectories. A central component in this process is the reward model, which evaluates partial solutions and guides the allocation of computation during inference. However, in practice, true reward models are never available. All deployed systems rely on approximate reward models, raising a fundamental question: Why and when do approximate reward models suffice for effective inference-time scaling? In this work, we provide a theoretical answer. We identify the Bellman error of the approximate reward model as the key quantity governing the effectiveness of SMC-based inference-time scaling. For a reasoning process of length $T$, we show that if the Bellman error of the approximate reward model is bounded by $O(1/T)$, then combining this reward model with SMC reduces the computational complexity of reasoning from exponential in $T$ to polynomial in $T$. This yields an exponential improvement in inference efficiency despite using only approximate rewards.
Related papers
- Nonparametric Bayesian Optimization for General Rewards [4.696963700743491]
We propose the first BO algorithm that achieves no-regret guarantee in a general reward setting, requiring only Lipschitz continuity of the objective function.<n>We develop a new TS regret analysis framework for general rewards, which relates the regret to the total variation distance between the surrogate model and the true reward distribution.<n> Empirical results demonstrate state-of-the-art performance, particularly in settings with non-stationary, heavy-tailed, or other ill-conditioned rewards.
arXiv Detail & Related papers (2026-02-07T07:01:33Z) - Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling [34.69440744042684]
We introduce an analytically tractable model of inference-time scaling.<n>We experimentally verify these facts in large language model inference with an additional large language model as a judge.
arXiv Detail & Related papers (2025-12-22T22:13:06Z) - Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models [57.49136894315871]
New paradigm of test-time scaling has yielded remarkable breakthroughs in reasoning models and generative vision models.<n>We propose one solution to the problem of integrating test-time scaling knowledge into a model during post-training.<n>We replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise.
arXiv Detail & Related papers (2025-08-13T17:33:37Z) - Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling [12.835376812101323]
We introduce the hypothesis that PRMs are also Partial Reward Models.<n>This allows for principled early rejection based on intermediate token-level signals.<n>On math reasoning benchmarks, our method achieves up to 1.4$times$-9$times$ reduction in inference FLOPs without degrading final performance.
arXiv Detail & Related papers (2025-08-04T00:58:56Z) - Intention-Conditioned Flow Occupancy Models [80.42634994902858]
Large-scale pre-training has fundamentally changed how machine learning research is done today.<n>Applying this same framework to reinforcement learning is appealing because it offers compelling avenues for addressing core challenges in RL.<n>Recent advances in generative AI have provided new tools for modeling highly complex distributions.
arXiv Detail & Related papers (2025-06-10T15:27:46Z) - RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution [50.171320156632866]
Reinforcement learning from human feedback offers a promising approach to aligning large language models with human preferences.<n>Current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence.<n>We propose a more fine-grained, token-level guidance approach for RL training.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis [16.288866201806382]
We develop a model-free RLHF best policy identification algorithm, called $mathsfBSAD$, without explicit reward model inference.<n>The algorithm identifies the optimal policy directly from human preference information in a backward manner.
arXiv Detail & Related papers (2024-06-11T17:01:41Z) - Scalable Ensembling For Mitigating Reward Overoptimisation [24.58937616758007]
Reinforcement Learning from Human Feedback has enabled significant advancements within language modeling for powerful, instruction-following models.
The alignment of these models remains a pressing challenge as the policy tends to overfit the learned proxy" reward model past an inflection point of utility.
arXiv Detail & Related papers (2024-06-03T05:46:53Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.