Related papers: Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood

URL: http://arxiv.org/abs/2510.09369v1
Date: Fri, 10 Oct 2025 13:25:28 GMT
Title: Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood
Authors: Xingyu Lin, Yilin Wen, En Wang, Du Su, Wenbin Liu, Chenfu Bao, Zhonghou Lv,
Abstract summary: We propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation.<n>Experiments show that TEPO consistently outperforms existing baselines across key metrics.<n>It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.
Score: 9.335167757513046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.

Related papers

Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models [49.65762241649762]
We propose a framework that treats sequences of K consecutive tokens as unified semantic actions.<n> Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines.
arXiv Detail & Related papers (2026-02-16T01:28:38Z)
Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
Owen-Shapley Policy Optimization (OSPO): A Principled RL Algorithm for Generative Search LLMs [12.75200353208858]
Owen-Shapley Policy Optimization (OSPO) is a framework that redistributes sequence-level advantages based on tokens' marginal contributions to outcomes.<n>Unlike value-model-based methods requiring additional computation, OSPO employs potential-based reward shaping via Shapley-Owen attributions to assign segment-level credit.<n> Experiments on Amazon ESCI and H&M Fashion datasets show consistent gains over baselines.
arXiv Detail & Related papers (2026-01-13T10:17:46Z)
Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR [31.43482175098666]
Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising framework for optimizing large language models in reasoning tasks.<n>Existing RLVR algorithms focus on different granularities, and each has complementary strengths and limitations.<n>We propose Dynamic Hybrid Policy Optimization (DHPO) to bridge GRPO and GSPO within a single clipped surrogate objective.
arXiv Detail & Related papers (2026-01-09T07:57:40Z)
Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment [61.80228667422234]
VGPO redefines value estimation across both temporal and group dimensions.<n>It transforms the sparse terminal reward into dense, process-aware value estimates.<n>It replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal.
arXiv Detail & Related papers (2025-12-13T16:31:26Z)
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective [85.06838178922791]
Reinforcement Learning (RL) has proven highly effective for autoregressive language models.<n>But adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges.<n>We propose a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy.
arXiv Detail & Related papers (2025-12-03T13:05:32Z)
Token-Regulated Group Relative Policy Optimization for Stable Reinforcement Learning in Large Language Models [18.785063555637613]
Group Relative Policy Optimization (GRPO) has demonstrated strong performance.<n>It suffers from a critical issue: low-probability tokens disproportionately dominate gradient updates.<n>This imbalance leads to unstable training and suppresses the contribution of high-probability tokens.
arXiv Detail & Related papers (2025-10-29T08:07:47Z)
GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy [5.691990020513277]
We propose Dynamic Entropy Weighting, a novel mechanism that facilitates fine-grained rewards through two new algorithms.<n>By repurposing policy entropy for reward shaping, we achieve true per-token credit assignment.
arXiv Detail & Related papers (2025-08-06T11:42:47Z)
TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization [73.16975077770765]
Recent advancements in reinforcement learning have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO)<n>It is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO)<n>This work decomposes PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance.
arXiv Detail & Related papers (2025-06-17T14:30:06Z)
Discriminative Policy Optimization for Token-Level Reward Models [55.98642069903191]
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs)<n>Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations.<n>Reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH.
arXiv Detail & Related papers (2025-05-29T11:40:34Z)
KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning [19.25257653416883]
Key-token Advantage Estimation (KTAE) is a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models.<n>We show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-05-22T16:00:33Z)
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.<n>We present a novel framework for identifying these tokens through rollout sampling.<n>We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.