DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning
- URL: http://arxiv.org/abs/2602.19895v1
- Date: Mon, 23 Feb 2026 14:37:01 GMT
- Title: DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning
- Authors: Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, Mi Zhang,
- Abstract summary: Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration.<n>We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components.<n>We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule.
- Score: 34.572542770447974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at https://github.com/SUSTechBruce/DSDR.
Related papers
- Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z) - Beyond Mode Elicitation: Diversity-Preserving Reinforcement Learning via Latent Diffusion Reasoner [23.232437167002406]
Latent Diffusion Reasoning with Reinforcement Learning (LaDi-RL) is a framework that conducts exploration directly in a continuous latent space.<n>By modeling exploration via guided diffusion, multi-step denoising distributesity and preserves multiple coexisting solution modes.<n> Experiments on code generation and mathematical reasoning benchmarks demonstrate consistent improvements in both pass@1 and pass@k over discrete RL.
arXiv Detail & Related papers (2026-02-02T06:26:31Z) - Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents [90.45197506653341]
Large language model (LLM) agents increasingly rely on external tools such as search engines to solve complex, multi-step problems.<n>The trajectories of search agents are structurally heterogeneous, where variations in the number, placement, and outcomes of search calls lead to fundamentally different answer directions and reward distributions.<n>Standard policy gradient methods, which use a single global baseline, suffer from what we identify and formalize as cross-stratum bias.<n>We propose Stratified GRPO, whose central component, Stratified Advantage Normalization (SAN), partitions trajectories into homogeneous strata based on their structural properties and computes advantages locally within each
arXiv Detail & Related papers (2025-10-07T17:59:13Z) - The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward [57.56453588632619]
A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance.<n>This is often accompanied by catastrophic forgetting, where models lose previously acquired skills.<n>We argue that standard RLVR objectives lack a crucial mechanism for knowledge retention.
arXiv Detail & Related papers (2025-09-09T06:34:32Z) - Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation [20.106116218594266]
DIVER is an end-to-end autonomous driving framework that integrates reinforcement learning and diffusion-based generation.<n>We show that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.
arXiv Detail & Related papers (2025-07-05T14:19:19Z) - Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences.
The proposed method is tested on a variety of tasks in DMcontrol and Meta-world.
It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - G$^2$DA: Geometry-Guided Dual-Alignment Learning for RGB-Infrared Person
Re-Identification [3.909938091041451]
RGB-IR person re-identification aims to retrieve person-of-interest between heterogeneous modalities.
This paper presents a Geometry-Guided Dual-Alignment learning framework (G$2$DA) to tackle sample-level modality difference.
arXiv Detail & Related papers (2021-06-15T03:14:31Z) - Semi-Supervised Learning with Variational Bayesian Inference and Maximum
Uncertainty Regularization [62.21716612888669]
We propose two generic methods for improving semi-supervised learning (SSL)
The first integrates weight perturbation (WP) into existing "consistency regularization" (CR) based methods.
The second method proposes a novel consistency loss called "maximum uncertainty regularization" (MUR)
arXiv Detail & Related papers (2020-12-03T09:49:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.