Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification
- URL: http://arxiv.org/abs/2601.21244v2
- Date: Mon, 02 Feb 2026 13:30:50 GMT
- Title: Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification
- Authors: Yiju Guo, Tianyi Hu, Zexu Sun, Yankai Lin,
- Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets.<n>We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference.<n>We propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens.
- Score: 44.681296696564004
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6$\times$ speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.
Related papers
- SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent [39.43590030917357]
SIGHT is a framework that enhances search-based reasoning through Self-Evidence Support and Information-Gain Driven Diverse Branching.<n>SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states.<n> Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches.
arXiv Detail & Related papers (2026-02-12T04:16:55Z) - Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z) - Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards [16.22162269278471]
PSN-RLVR perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration.<n>We propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty.
arXiv Detail & Related papers (2026-01-30T13:10:30Z) - LLM Optimization Unlocks Real-Time Pairwise Reranking [6.0141312590967635]
Pairwise Reranking Prompting (PRP) has emerged as a promising plug-and-play approach due to its usability and effectiveness.<n>This paper presents a focused study on pairwise reranking, demonstrating that carefully applied optimization methods can significantly mitigate these issues.<n>We achieve a remarkable latency reduction of up to 166 times, from 61.36 seconds to 0.37 seconds per query, with an insignificant drop in performance measured by Recall@k.
arXiv Detail & Related papers (2025-11-10T19:04:41Z) - PACR: Progressively Ascending Confidence Reward for LLM Reasoning [55.06373646059141]
We propose Progressively Ascending Confidence Reward (PACR)<n>PACR is a dense, model-intrinsic reward computed directly from the model's evolving belief in the correct answer.<n>Our results suggest that dense, model-intrinsic shaping signals can make RLVR training more effective and reliable.
arXiv Detail & Related papers (2025-10-25T11:25:35Z) - Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration [61.350777880329815]
Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models.<n>We show that RLVR's full potential is hindered by two under-explored dimensions: depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration.<n>We introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts.
arXiv Detail & Related papers (2025-08-19T11:51:40Z) - NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation [66.36912000442608]
NoisyRollout is a simple yet effective data augmentation method.<n>It mixes training trajectories from both clean and moderately distorted images.<n>It achieves state-of-the-art performance among open-source RL-tuned models.
arXiv Detail & Related papers (2025-04-17T16:10:13Z) - ROPO: Robust Preference Optimization for Large Language Models [59.10763211091664]
We propose an iterative alignment approach that integrates noise-tolerance and filtering of noisy samples without the aid of external models.
Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B demonstrate that ROPO significantly outperforms existing preference alignment methods.
arXiv Detail & Related papers (2024-04-05T13:58:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.