LACONIC: Length-Aware Constrained Reinforcement Learning for LLM
- URL: http://arxiv.org/abs/2602.14468v1
- Date: Mon, 16 Feb 2026 05:09:40 GMT
- Title: LACONIC: Length-Aware Constrained Reinforcement Learning for LLM
- Authors: Chang Liu, Yiran Zhao, Lawrence Liu, Yaoqi Ye, Csaba Szepesvári, Lin F. Yang,
- Abstract summary: LACONIC is a reinforcement learning method that enforces a target token budget during training.<n>It preserves or improves pass@1 while reducing output length by over 50%.<n>It maintains out-of-domain performance on general knowledge and multilingual benchmarks with 44% fewer tokens.
- Score: 29.383977698780374
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) has enhanced the capabilities of large language models (LLMs) through reward-driven training. Nevertheless, this process can introduce excessively long responses, inflating inference latency and computational overhead. Prior length-control approaches typically rely on fixed heuristic reward shaping, which can misalign with the task objective and require brittle tuning. In this work, we propose LACONIC, a reinforcement learning method that enforces a target token budget during training. Specifically, we update policy models using an augmented objective that combines the task reward with a length-based cost. To balance brevity and task performance, the cost scale is adaptively adjusted throughout training. This yields robust length control while preserving task reward. We provide a theoretical guarantee that support the method. Across mathematical reasoning models and datasets, LACONIC preserves or improves pass@1 while reducing output length by over 50%. It maintains out-of-domain performance on general knowledge and multilingual benchmarks with 44% fewer tokens. Moreover, LACONIC integrates into standard RL-tuning with no inference changes and minimal deployment overhead.
Related papers
- QuRL: Efficient Reinforcement Learning with Quantized Rollout [23.326106976928898]
Reinforcement learning with verifiable rewards (RLVR) has become a trending paradigm for training reasoning large language models (LLMs)<n>Due to the autoregressive decoding nature of LLMs, the rollout process becomes the efficiency bottleneck of RL training, consisting of up to 70% of the total training time.<n>We propose Quantized Reinforcement Learning (QuRL) that uses a quantized actor for accelerating the rollout.
arXiv Detail & Related papers (2026-02-15T01:48:10Z) - Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL [34.12869266614113]
We introduce RC, an iterative decoding algorithm that replaces standard autoregressive decoding during both training and inference.<n>RC exploits an asymmetry between the response generation and summarization capabilities of LLMs to construct reasoning chains that consistently improve.<n> Empirically, training a 4B model with RC using a 16k-token training budget improves performance on HMMT 2025 from 40% to nearly 70% with 0.5m tokens at test time.
arXiv Detail & Related papers (2026-02-03T17:34:04Z) - CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs [31.371566320424552]
CoBA-RL is a reinforcement learning algorithm designed to adaptively allocate rollout budgets based on the model's evolving capability.<n>Our approach effectively orchestrates the trade-off between exploration and exploitation, delivering consistent generalization improvements.
arXiv Detail & Related papers (2026-02-03T03:14:36Z) - Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates [53.3717573880076]
We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates.<n>JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly.<n>Experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods.
arXiv Detail & Related papers (2026-01-26T14:16:51Z) - From Supervision to Exploration: What Does Protein Language Model Learn During Reinforcement Learning? [76.288870982181]
Protein language models (PLMs) have advanced computational protein science through large-scale pretraining and scalable architectures.<n> reinforcement learning (RL) has broadened exploration and enabled precise multi-objective optimization in protein design.<n>We ask if RL improves sampling efficiency and, more importantly, if it reveals capabilities not captured by supervised learning.
arXiv Detail & Related papers (2025-10-02T01:31:10Z) - Train Long, Think Short: Curriculum Learning for Efficient Reasoning [51.506559652495476]
We propose a curriculum learning strategy for length-controlled reasoning.<n>Our method starts with generous token budgets and gradually tightens them over training.<n>Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines.
arXiv Detail & Related papers (2025-08-12T13:48:03Z) - Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z) - Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs)
We propose a new RL method named RLMEC that incorporates a generative model as the reward model.
Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z) - Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning.
Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.