Related papers: Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

URL: http://arxiv.org/abs/2602.14386v1
Date: Mon, 16 Feb 2026 01:28:38 GMT
Title: Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models
Authors: Mufan Xu, Kehai Chen, Xuefeng Bai, Zhengyu Niu, Muyun Yang, Tiejun Zhao, Min Zhang,
Abstract summary: We propose a framework that treats sequences of K consecutive tokens as unified semantic actions.<n> Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines.
Score: 49.65762241649762
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens--for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives. Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines, highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.

Related papers

Learning Policy Representations for Steerable Behavior Synthesis [80.4542176039074]
Given a Markov decision process (MDP), we seek to learn representations for a range of policies to facilitate behavior steering at test time.<n>We show that these representations can be approximated uniformly for a range of policies using a set-based architecture.<n>We use variational generative approach to induce a smooth latent space, and further shape it with contrastive learning so that latent distances align with differences in value functions.
arXiv Detail & Related papers (2026-01-29T21:52:06Z)
Policy-Conditioned Policies for Multi-Agent Task Solving [53.67744322553693]
In this work, we propose a paradigm shift that bridges the gap by representing policies as human-interpretable source code.<n>We reformulate the learning problem by utilizing Large Language Models (LLMs) as approximate interpreters.<n>We formalize this process as textitProgrammatic Iterated Best Response (PIBR), an algorithm where the policy code is optimized by textual gradients.
arXiv Detail & Related papers (2025-12-24T07:42:10Z)
Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood [9.335167757513046]
We propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation.<n>Experiments show that TEPO consistently outperforms existing baselines across key metrics.<n>It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.
arXiv Detail & Related papers (2025-10-10T13:25:28Z)
Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z)
From Language to Logic: A Bi-Level Framework for Structured Reasoning [6.075080928704587]
Structured reasoning over natural language inputs remains a core challenge in artificial intelligence.<n>We propose a novel framework that maps language to logic through a two-stage process: high-level task abstraction and low-level logic generation.<n>Our approach significantly outperforms existing baselines in accuracy, with accuracy gains reaching as high as 40%.
arXiv Detail & Related papers (2025-07-11T11:24:09Z)
Enhancing Decision-Making of Large Language Models via Actor-Critic [28.870961806283425]
Large Language Models (LLMs) have achieved remarkable advancements in natural language processing tasks.<n>Existing methods either rely on short-term auto-regressive action generation or face limitations in accurately simulating rollouts and assessing outcomes.<n>This paper introduces a novel LLM-based Actor-Critic framework, termed LAC, that effectively improves LLM policies with long-term action evaluations.
arXiv Detail & Related papers (2025-06-04T14:58:27Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm. We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.