Related papers: Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

URL: http://arxiv.org/abs/2508.13755v4
Date: Mon, 06 Oct 2025 11:12:22 GMT
Title: Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration
Authors: Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Yiwei Wang, Xiaodan Liang, Jing Tang,
Abstract summary: Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models.<n>We show that RLVR's full potential is hindered by two under-explored dimensions: depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration.<n>We introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts.
Score: 61.350777880329815
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration. We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weighting the low-accuracy instances that are crucial for pushing reasoning boundaries. To rectify the depth neglect, we introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts, thereby increasing the number of positive rollouts for hard problems. Empirically, naively enlarging rollout size only accelerates convergence and even hurts Pass@K. Our DARS, in contrast, delivers consistent Pass@K gains without extra inference cost at convergence. Just as we adaptively expanded the depth of exploration, we now ask whether aggressively scaling the breadth of training data can further amplify reasoning gains. To this end, we intensely scale batch size and replace PPO's mini-batch iterations with full-batch updates over multiple epochs. Increasing breadth significantly enhances Pass@1 performance. Large-breadth training sustains high token-level entropy, indicating continued exploration and reduced gradient noise. We further present DARS-B, which augments DARS with large breadth, and demonstrate simultaneous gains in Pass@K and Pass@1. The results confirm that breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, which are key to unleashing the reasoning power of RLVR.

Related papers

Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards [16.22162269278471]
PSN-RLVR perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration.<n>We propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty.
arXiv Detail & Related papers (2026-01-30T13:10:30Z)
DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning [134.03095505580276]
Doing Length pEnalty Right (DLER) is a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty.<n>DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy.
arXiv Detail & Related papers (2025-10-16T20:05:57Z)
BroRL: Scaling Reinforcement Learning via Broadened Exploration [88.69554867685243]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key ingredient for unlocking complex reasoning capabilities in large language models.<n>Recent work ProRL has shown promise in scaling RL by increasing the number of training steps.<n>We investigate a complementary paradigm for scaling RL, BroR-Lincreasing the number of rollouts per example to hundreds.
arXiv Detail & Related papers (2025-10-01T17:59:02Z)
Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding [59.60915947702282]
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs)<n>Existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability.<n>We propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region.
arXiv Detail & Related papers (2025-09-08T17:36:21Z)
TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression [55.37723860832064]
We propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations.<n>We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels.
arXiv Detail & Related papers (2025-06-03T09:23:41Z)
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation [66.36912000442608]
NoisyRollout is a simple yet effective data augmentation method.<n>It mixes training trajectories from both clean and moderately distorted images.<n>It achieves state-of-the-art performance among open-source RL-tuned models.
arXiv Detail & Related papers (2025-04-17T16:10:13Z)
Dissecting Deep RL with High Update Ratios: Combatting Value Divergence [21.282292112642747]
We show that deep reinforcement learning algorithms can retain their ability to learn without resetting network parameters. We employ a simple unit-ball normalization that enables learning under large update ratios.
arXiv Detail & Related papers (2024-03-09T19:56:40Z)
DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification. DASA generates diversified training samples in speaker embedding space with negligible extra computing cost. The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z)
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter [113.35761858962522]
This paper studies induced sparse patterns across multiple large pre-trained vision and language transformers. We propose the existence of essential sparsity defined with a sharp dropping point beyond which the performance declines much faster. We also find essential sparsity to hold valid for N:M sparsity patterns as well as on modern-scale large language models.
arXiv Detail & Related papers (2023-06-06T15:49:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.