Related papers: Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration

Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration

URL: http://arxiv.org/abs/2510.03865v2
Date: Fri, 31 Oct 2025 06:08:26 GMT
Title: Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
Authors: Wenhao Deng, Long Wei, Chenglei Yu, Tailin Wu,
Abstract summary: We propose RAPO, an algorithm to promote broader yet focused exploration.<n>We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset.<n>Results show that RAPO consistently improves problem-solving performance.
Score: 8.839121572048018
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has recently enhanced the reasoning capabilities of large language models (LLMs), particularly for mathematical problem solving. However, a fundamental limitation remains: as the sampling budget increases, the advantage of RLVR-trained models over their pretrained bases often diminishes or even vanishes, revealing a strong dependence on the base model's restricted search space. We attribute this phenomenon to the widespread use of the reverse Kullback-Leibler (KL) divergence regularizer, whose mode-seeking behavior keeps the policy trapped inside the base model's support region and hampers wider exploration. To address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an algorithm to promote broader yet focused exploration. Our method (i) utilizes the forward KL penalty to replace the reverse KL penalty for out-of-distribution exploration, and (ii) reweights the reference policy to facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset, without supervised fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO consistently improves problem-solving performance. Notably, RAPO enables models to surpass the base model's performance ceiling and solves previously intractable problems, advancing the frontier of RLVR for challenging reasoning tasks.

Related papers

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z)
Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards [16.22162269278471]
PSN-RLVR perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration.<n>We propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty.
arXiv Detail & Related papers (2026-01-30T13:10:30Z)
Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models [33.214586668992965]
Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning.<n>We propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge.<n>Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning.
arXiv Detail & Related papers (2025-10-24T19:08:48Z)
Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning [49.290631188365786]
Scaf-GRPO is a training framework that intervenes when a model's independent learning has plateaued.<n>It boosts the pass@1 score of the Qwen2.5-Math-7B model by a relative 44.3% over a vanilla GRPO baseline.<n>This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach.
arXiv Detail & Related papers (2025-10-22T17:41:30Z)
Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models [22.50153462109328]
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs)<n>We introduce a Risk-Sensitive Reinforcement Learning framework.<n>Our approach employs a risk-seeking objective that interpolates between mean and maximum rewards, leading to a novel algorithm.<n>Remarkably, RS-GRPO is simple to implement, requiring only minor code modifications.
arXiv Detail & Related papers (2025-09-29T04:12:20Z)
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z)
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization [50.91849555841057]
Group Relative Policy Optimization is a reinforcement learning method for large reasoning models (LRMs)<n>We introduce a new Discriminative Constrained Optimization framework for reinforcing LRMs, grounded in the principle of discriminative learning.<n>DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO.
arXiv Detail & Related papers (2025-05-18T11:08:32Z)
Learning to Reason under Off-Policy Guidance [40.27817638425237]
We introduce textbfLUFFY (textbfLearning to reason textbfUnder otextbfFF-polictextbfY guidance), a framework that augments textitRLVR with off-policy reasoning traces.<n>LUFFY balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training.
arXiv Detail & Related papers (2025-04-21T08:09:13Z)
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? [67.30809748319486]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs)<n>This study critically examines the current state of RLVR.<n>We find that the current training setup does not elicit fundamentally new reasoning patterns.
arXiv Detail & Related papers (2025-04-18T17:59:56Z)
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z)
SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF [22.88031166401938]
This paper presents SALSA, a novel approach designed to overcome limitations by creating a more flexible and better located reference model. We show that SALSA fosters better exploration, achieving higher rewards and improving model robustness, out-of-distribution, and performance.
arXiv Detail & Related papers (2024-11-04T04:53:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.