Related papers: Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

URL: http://arxiv.org/abs/2602.05261v1
Date: Thu, 05 Feb 2026 03:35:38 GMT
Title: Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR
Authors: Fanfan Liu, Youyang Yin, Peng Shi, Siqi Yang, Zhixiong Zeng, Haibo Qiu,
Abstract summary: An increase in response length is often regarded as a key factor contributing to the growth of reasoning ability.<n>This paper conducts an in-depth analysis of the components of mainstream RLVR algorithms.<n>We propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm.
Score: 11.820526438759238
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.

Related papers

LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards [57.993003392037174]
LongR is a framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism.<n>LongR consistently enhances performance across diverse RL algorithms.
arXiv Detail & Related papers (2026-02-05T15:26:47Z)
Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models [71.9060068259379]
We propose cascaded domain-wise reinforcement learning to build general-purpose reasoning models.<n>Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6 Pro and silver-medal performance in the 2025 International Olympiad in Informatics (IOI)
arXiv Detail & Related papers (2025-12-15T18:02:35Z)
Rectifying LLM Thought from Lens of Optimization [48.98086817378953]
Long chain-of-thought (CoT) prompting enables thorough exploration and deliberation.<n>Despite advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors.<n>We introduce RePro, a novel approach to refine LLM reasoning during post-training.
arXiv Detail & Related papers (2025-12-01T17:41:08Z)
Making Mathematical Reasoning Adaptive [61.45161826629692]
We propose the AdaR framework to enable adaptive reasoning in large language models (LLMs)<n>AdaR synthesizes logically equivalent queries by varying variable values, and trains models with RLVR on these data to penalize spurious logic.<n> Experimental results demonstrate that AdaR improves robustness and generalization, achieving substantial improvement in mathematical reasoning.
arXiv Detail & Related papers (2025-10-06T09:30:05Z)
LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning [20.48365890565577]
We propose a novel meta-RLVR algorithm that dynamically selects training data at each step based on the average response length.<n>We evaluate LSPO across multiple base models and datasets, demonstrating that it consistently improves learning effectiveness.
arXiv Detail & Related papers (2025-10-01T20:57:22Z)
Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning [10.255235456427037]
We propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in Large Language Models (LLMs)<n>The first stage, using more training steps, aims to incentivize the model's reasoning capabilities via Group Relative Policy Optimization.<n>The second stage, using fewer training steps, explicitly enforces conciseness and improves efficiency via Length-aware Group Relative Policy Optimization.
arXiv Detail & Related papers (2025-05-27T13:29:51Z)
TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs [50.820065021136024]
DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs)<n>Recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings.<n>We propose TACO, a novel reinforcement learning algorithm for visual reasoning.
arXiv Detail & Related papers (2025-05-27T06:30:48Z)
DGRO: Enhancing LLM Reasoning via Exploration-Exploitation Control and Reward Variance Management [18.953750405635393]
Decoupled Group Reward Optimization (DGRO) is a general RL algorithm for Large Language Models (LLMs) reasoning.<n>We show that DGRO achieves state-of-the-art performance on the Logic dataset with an average accuracy of 96.9%, and demonstrates strong generalization across mathematical benchmarks.
arXiv Detail & Related papers (2025-05-19T10:44:49Z)
An In-depth Investigation of Sparse Rate Reduction in Transformer-like Models [32.04194224236952]
We propose an information-theoretic objective function called Sparse Rate Reduction (SRR) We show that SRR has a positive correlation coefficient and outperforms other baseline measures, such as path-norm and sharpness-based ones. We show that generalization can be improved using SRR as regularization on benchmark image classification datasets.
arXiv Detail & Related papers (2024-11-26T07:44:57Z)
A Long Way to Go: Investigating Length Correlations in RLHF [59.49656695716066]
This paper demonstrates, on three diverse settings, that optimizing for response length is a significant factor behind RLHF. We find improvements in reward to largely be driven by increasing response length, instead of other features. Even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models.
arXiv Detail & Related papers (2023-10-05T17:38:28Z)
Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning [24.09547181095033]
Causal Graph is a structure built upon the relation between objects and events. We propose a framework with theoretical performance guarantees that alternates between two steps. Our performance improvement is attributed to the virtuous cycle of causal discovery, transition modeling, and policy training.
arXiv Detail & Related papers (2022-07-19T05:31:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.