Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective
- URL: http://arxiv.org/abs/2506.02553v1
- Date: Tue, 03 Jun 2025 07:44:31 GMT
- Title: Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective
- Authors: Shenghua He, Tian Xia, Xuan Zhou, Hui Wei,
- Abstract summary: We study the Zero-Reward Assumption in reinforcement learning for large language models (LLMs)<n>We show that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model.<n>We propose a new algorithm: Token-Reinforced Policy Optimization (TRePO)
- Score: 6.069069082518759
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study a common challenge in reinforcement learning for large language models (LLMs): the Zero-Reward Assumption, where non-terminal actions (i.e., intermediate token generations) receive zero task-specific immediate reward, while only the final token receives a reward for the entire response. This assumption arises frequently in practice, as precise token-level rewards are often difficult or infeasible to obtain in LLM applications. In this work, we provide a unifying theoretical perspective. We introduce the Trajectory Policy Gradient Theorem, which shows that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model, regardless of whether the Zero-Reward Assumption holds or not, for algorithms in the REINFORCE and Actor-Critic families. This result reveals that widely used methods such as PPO, GRPO, ReMax, and RLOO inherently possess the capacity to model token-level reward signals, offering a theoretical justification for response-level reward approaches. Our findings pave the way for more practical, efficient LLM fine-tuning, allowing developers to treat training algorithms as black boxes and focus on improving the response-level reward model with auxiliary sub-models. We also offer a detailed analysis of popular RL and non-RL methods, comparing their theoretical foundations and practical advantages across common LLM tasks. Finally, we propose a new algorithm: Token-Reinforced Policy Optimization (TRePO), a theoretically grounded method that is simpler than PPO, matches GRPO in memory efficiency, and holds promise for broad applicability.
Related papers
- The Peril of Preference: Why GRPO fails on Ordinal Rewards [0.8937905773981699]
We introduce Correctness Relative Policy Optimization (CoRPO), a new formulation that solves this flaw.<n>CoRPO uses an adaptive baseline that enforces a minimum quality threshold, ensuring failed solutions are never positively reinforced.<n>We empirically validate CoRPO on a code verification task, where it demonstrates more stable convergence and better out-of-domain generalization.
arXiv Detail & Related papers (2025-11-06T15:12:50Z) - Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z) - Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents [28.145430029174577]
Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments.<n>Existing approaches typically rely on outcome-based rewards that are only provided at the final answer.<n>In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training.
arXiv Detail & Related papers (2025-10-16T17:59:32Z) - CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment [39.965170904699974]
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback.<n>Current RLVR methods treat whole responses as single actions, assigning the same reward to every token.<n>This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure.
arXiv Detail & Related papers (2025-08-04T11:06:08Z) - Revisiting LLM Reasoning via Information Bottleneck [57.519119962528166]
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR)<n>We present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle.<n>We propose IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable.
arXiv Detail & Related papers (2025-07-24T13:14:25Z) - Generalist Reward Models: Found Inside Large Language Models [50.7432354447554]
We show that a powerful reward model is already latently present within any Large Language Models (LLMs) trained via standard next-token prediction.<n>We prove that this endogenous reward is not a reward function learned through offline inverse reinforcement learning.<n>We also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model.
arXiv Detail & Related papers (2025-06-29T13:45:54Z) - A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z) - VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [66.80143024475635]
We propose VinePPO, a straightforward approach to compute unbiased Monte Carlo-based estimates.
We show that VinePPO consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets.
arXiv Detail & Related papers (2024-10-02T15:49:30Z) - DPO Meets PPO: Reinforced Token Optimization for RLHF [35.638723885233475]
We introduce an algorithm that learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal.<n>Experiments demonstrate that textttRTO performs better than PPO and other direct preference learning algorithms.
arXiv Detail & Related papers (2024-04-29T17:58:30Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.