Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning
- URL: http://arxiv.org/abs/2510.05251v1
- Date: Mon, 06 Oct 2025 18:15:43 GMT
- Title: Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning
- Authors: Chenghao Yang, Lin Gui, Chenxiao Yang, Victor Veitch, Lizhu Zhang, Zhuokai Zhao,
- Abstract summary: Reinforcement learning with verifiable rewards (RLVR) is a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs)<n>Standard fixed-temperature sampling is simple, but it struggles to balance these competing demands, as high temperatures degrade sample quality and low temperatures limit discovery.<n>We propose a simpler and more effective strategy, Exploratory Annealed Decoding (EAD), grounded in the insight that exploration is most impactful on early tokens.
- Score: 29.277754405630205
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning with verifiable rewards (RLVR) is a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs), yet its success hinges on effective exploration. An ideal exploration strategy must navigate two fundamental challenges: it must preserve sample quality while also ensuring training stability. While standard fixed-temperature sampling is simple, it struggles to balance these competing demands, as high temperatures degrade sample quality and low temperatures limit discovery. In this work, we propose a simpler and more effective strategy, Exploratory Annealed Decoding (EAD), grounded in the insight that exploration is most impactful on early tokens which define a sequence's semantic direction. EAD implements an intuitive **explore-at-the-beginning, exploit-at-the-end** strategy by annealing the sampling temperature from high to low during generation. This dynamic schedule encourages meaningful, high-level diversity at the start, then gradually lowers the temperature to preserve sample quality and keep the sampling distribution close to the target policy, which is essential for stable training. We demonstrate that EAD is a lightweight, plug-and-play method that significantly improves sample efficiency, consistently outperforming fixed-temperature sampling across various RLVR algorithms and model sizes. Our work suggests that aligning exploration with the natural dynamics of sequential generation offers a robust path to improving LLM reasoning.
Related papers
- Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL [30.357975264905978]
We propose a hierarchical reinforcement learning framework that learns to control sampling temperature during generation.<n>At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution.<n>Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme.
arXiv Detail & Related papers (2026-02-13T15:42:59Z) - AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models [54.56296715999545]
Reinforcement learning from human feedback shows promise for aligning diffusion and flow models.<n>Policy optimization methods such as GRPO suffer from inefficient and static sampling strategies.<n>We propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy.
arXiv Detail & Related papers (2026-02-06T16:09:50Z) - Tailored Primitive Initialization is the Secret Key to Reinforcement Learning [61.29280885291581]
Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs)<n>We argue that initializing LLMs with diverse, high-quality reasoning primitives is essential for achieving stable and sample-efficient RL training.<n>We propose Tailor, a finetuning pipeline that automatically discovers and curates novel reasoning primitives.
arXiv Detail & Related papers (2025-11-16T03:12:40Z) - PACR: Progressively Ascending Confidence Reward for LLM Reasoning [55.06373646059141]
We propose Progressively Ascending Confidence Reward (PACR)<n>PACR is a dense, model-intrinsic reward computed directly from the model's evolving belief in the correct answer.<n>Our results suggest that dense, model-intrinsic shaping signals can make RLVR training more effective and reliable.
arXiv Detail & Related papers (2025-10-25T11:25:35Z) - Control the Temperature: Selective Sampling for Diverse and High-Quality LLM Outputs [26.477037145228735]
Temperature-based sampling is a common strategy to increase diversity.<n>But uncontrolled high temperature sampling, e.g., min-$p$ or top-$p$, degrades reasoning quality.<n>We propose textbfselective sampling, a method that switches between greedy and high-temperature sampling.
arXiv Detail & Related papers (2025-09-20T15:16:27Z) - Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z) - From Data-Centric to Sample-Centric: Enhancing LLM Reasoning via Progressive Optimization [7.531052649961168]
Reinforcement learning with verifiable rewards (RLVR) has recently advanced the reasoning capabilities of large language models (LLMs)<n>We investigate RLVR from a sample-centric perspective and introduce LPPO, a framework of progressive optimization techniques.<n>Our work addresses a critical question: how to best leverage a small set of trusted, high-quality demonstrations, rather than simply scaling up data volume.
arXiv Detail & Related papers (2025-07-09T06:05:28Z) - Ctrl-Z Sampling: Diffusion Sampling with Controlled Random Zigzag Explorations [17.357140159249496]
We propose a novel sampling strategy that adaptively detects and escapes steep local maxima.<n>We show that Ctrl-Z Sampling substantially improves generation quality while requiring only about 7.72 times the NFEs of the original.
arXiv Detail & Related papers (2025-06-25T10:01:00Z) - From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision [18.555485444818835]
We construct an innovative Progressive Active Learning (PAL) framework for single point supervision.<n>We propose a model pre-start concept, which focuses on automatically selecting a portion of easy samples.<n>We show that existing SIRST detection networks equipped with our PAL framework have achieved state-of-the-art (SOTA) results on multiple public datasets.
arXiv Detail & Related papers (2024-12-15T11:08:49Z) - Learning Off-policy with Model-based Intrinsic Motivation For Active Online Exploration [15.463313629574111]
This paper investigates how to achieve sample-efficient exploration in continuous control tasks.
We introduce an RL algorithm that incorporates a predictive model and off-policy learning elements.
We derive an intrinsic reward without incurring parameters overhead.
arXiv Detail & Related papers (2024-03-31T11:39:11Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Simplifying Model-based RL: Learning Representations, Latent-space
Models, and Policies with One Objective [142.36200080384145]
We propose a single objective which jointly optimize a latent-space model and policy to achieve high returns while remaining self-consistent.
We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods.
arXiv Detail & Related papers (2022-09-18T03:51:58Z) - Rethinking Sampling Strategies for Unsupervised Person Re-identification [59.47536050785886]
We analyze the reasons for the performance differences between various sampling strategies under the same framework and loss function.<n>Group sampling is proposed, which gathers samples from the same class into groups.<n>Experiments on Market-1501, DukeMTMC-reID and MSMT17 show that group sampling achieves performance comparable to state-of-the-art methods.
arXiv Detail & Related papers (2021-07-07T05:39:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.