Related papers: Generalist Reward Models: Found Inside Large Language Models

Generalist Reward Models: Found Inside Large Language Models

URL: http://arxiv.org/abs/2506.23235v1
Date: Sun, 29 Jun 2025 13:45:54 GMT
Title: Generalist Reward Models: Found Inside Large Language Models
Authors: Yi-Chen Li, Tian Xu, Yang Yu, Xuqin Zhang, Xiong-Hui Chen, Zhongxiang Ling, Ningjing Chao, Lei Yuan, Zhi-Hua Zhou,
Abstract summary: We show that a powerful reward model is already latently present within any Large Language Models (LLMs) trained via standard next-token prediction.<n>We prove that this endogenous reward is not a reward function learned through offline inverse reinforcement learning.<n>We also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model.
Score: 50.7432354447554
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The alignment of Large Language Models (LLMs) is critically dependent on reward models trained on costly human preference data. While recent work explores bypassing this cost with AI feedback, these methods often lack a rigorous theoretical foundation. In this paper, we discover that a powerful generalist reward model is already latently present within any LLM trained via standard next-token prediction. We prove that this endogenous reward is not a heuristic, but is theoretically equivalent to a reward function learned through offline inverse reinforcement learning. This connection allows us to directly elicit a high-quality reward signal from a base (pre-trained or supervised fine-tuned) model without any further training. Critically, we also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model. To our best knowledge, this is the first theoretical proof of the effectiveness of reinforcement learning for LLMs. Our experiments validate this theory, demonstrating that our method not only outperforms existing LLM-as-a-judge approaches but can also surpass explicitly trained reward models. These findings suggest that the reward modeling stage can be replaced by a principled method of eliciting the knowledge already captured during pre-training, heralding a more efficient, powerful, and scalable paradigm for LLMs alignment as well as multi-modal models.

Related papers

One Token to Fool LLM-as-a-Judge [31.421917676213415]
Generative reward models (also known as LLMs-as-judges) are increasingly adopted in reinforcement learning.<n>We show that generative reward models exhibit surprising vulnerabilities to superficial manipulations.<n>We introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness.
arXiv Detail & Related papers (2025-07-11T17:55:22Z)
ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning [12.83211408922535]
Reinforcement learning-style post-training improves reasoning by optimizing model outputs based on reward or preference signals.<n> GRPO-style approaches implement this by using self-generated samples labeled by an outcome-based verifier.<n>We propose $textbfSelf-Explanation Policy Optimization (ExPO)$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer.
arXiv Detail & Related papers (2025-07-03T17:44:55Z)
Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective [6.069069082518759]
We study the Zero-Reward Assumption in reinforcement learning for large language models (LLMs)<n>We show that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model.<n>We propose a new algorithm: Token-Reinforced Policy Optimization (TRePO)
arXiv Detail & Related papers (2025-06-03T07:44:31Z)
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [54.4392552373835]
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs)<n>We propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals to provide reliable rewards.<n>We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks.
arXiv Detail & Related papers (2025-02-26T17:19:12Z)
Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model. We show that the proposed algorithms converge to the stationary solutions of the IRL problem. Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
arXiv Detail & Related papers (2024-05-28T07:11:05Z)
RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models. The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output. We use these attention weights to redistribute the reward along the whole completion. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z)
Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons [79.98542868281473]
We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF) We show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions.
arXiv Detail & Related papers (2023-01-26T18:07:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.