Related papers: Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

URL: http://arxiv.org/abs/2602.02555v1
Date: Fri, 30 Jan 2026 13:10:30 GMT
Title: Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards
Authors: Bizhe Bai, Xinyue Wang, Peng Ye, Tao Chen,
Abstract summary: PSN-RLVR perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration.<n>We propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty.
Score: 16.22162269278471
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning, yet growing evidence indicates an exploration ceiling: it often reweights existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets (e.g., pass-at-256). We address this limitation with PSN-RLVR, which perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that better preserves long-horizon chain-of-thought coherence than action-space noise. To mitigate the resulting sampling-update mismatch, we incorporate truncated importance sampling (TIS). To avoid expensive KL-based adaptive noise control, we propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty. Instantiated on GRPO, a widely used RLVR method, PSN-GRPO consistently expands the effective reasoning capability boundary across multiple mathematical reasoning benchmarks and model families, yielding higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods (e.g., Pass-at-k-style training) while remaining orthogonal and thus composable for additional gains.

Related papers

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z)
Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards [69.74686029941881]
Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models.<n>We propose a unified neural scheduling framework that adaptively selects high-value rollouts throughout training.<n>Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.
arXiv Detail & Related papers (2026-02-09T10:51:58Z)
Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z)
Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification [44.681296696564004]
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets.<n>We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference.<n>We propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens.
arXiv Detail & Related papers (2026-01-29T04:08:24Z)
Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training [47.26632817047513]
Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates.<n>We propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs.<n>Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process.
arXiv Detail & Related papers (2025-10-06T16:34:09Z)
Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration [8.839121572048018]
We propose RAPO, an algorithm to promote broader yet focused exploration.<n>We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset.<n>Results show that RAPO consistently improves problem-solving performance.
arXiv Detail & Related papers (2025-10-04T16:22:19Z)
G$^2$RPO: Granular GRPO for Precise Reward in Flow Models [74.21206048155669]
We propose a novel Granular-GRPO (G$2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions.<n>We introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales.<n>Our G$2$RPO significantly outperforms existing flow-based GRPO baselines.
arXiv Detail & Related papers (2025-10-02T12:57:12Z)
Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models [22.50153462109328]
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs)<n>We introduce a Risk-Sensitive Reinforcement Learning framework.<n>Our approach employs a risk-seeking objective that interpolates between mean and maximum rewards, leading to a novel algorithm.<n>Remarkably, RS-GRPO is simple to implement, requiring only minor code modifications.
arXiv Detail & Related papers (2025-09-29T04:12:20Z)
Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration [61.350777880329815]
Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models.<n>We show that RLVR's full potential is hindered by two under-explored dimensions: depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration.<n>We introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts.
arXiv Detail & Related papers (2025-08-19T11:51:40Z)
Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling. Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.