Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning
- URL: http://arxiv.org/abs/2602.00759v1
- Date: Sat, 31 Jan 2026 14:48:23 GMT
- Title: Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning
- Authors: Zhipeng Chen, Xiaobo Qin, Wayne Xin Zhao, Youbin Wu, Ji-Rong Wen,
- Abstract summary: A$2$D is an Adaptive Ability Decomposing method for enhancing the effectiveness ofReinforcement Learning with verifiable rewards.<n>We first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions.<n>Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance.
- Score: 82.91265691530351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown great potential to enhance the reasoning ability of large language models (LLMs). However, due to the limited amount of information provided during the RLVR process, the model can only engage in largely blind exploration, which often results in failure on challenging problems. To provide additional information for the RLVR process without relying on a teacher model, we propose A$^2$D, an Adaptive Ability Decomposing method for enhancing the effectiveness of RLVR. Specifically, we first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions. Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance. To better understand A$^2$D, we first compare its performance with competitive baselines, showing its effectiveness. Next, we observe that our method functions as a plug-and-play module that can be applied to different RLVR algorithms. Furthermore, we conduct an analysis of the decomposer, revealing how the RLVR process affects its performance and behavior, and which type of guidance is better suited for enhancing the reasoner's exploration and exploitation abilities.
Related papers
- LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards [51.45138356629732]
We introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward.<n>This auxiliary signal directly incentivizes the model for selecting the correct grounding information.<n>LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks.
arXiv Detail & Related papers (2026-03-02T18:07:53Z) - Detecting RLVR Training Data via Structural Convergence of Reasoning [31.260852555788205]
Reinforcement learning with verifiable rewards (RLVR) is central to training modern reasoning models.<n>We show that RLVR induces a distinctive behavioral signature.<n>We introduce Min-$k$NN Distance, a simple black-box detector that quantifies this collapse.
arXiv Detail & Related papers (2026-02-12T10:17:32Z) - Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards [69.74686029941881]
Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models.<n>We propose a unified neural scheduling framework that adaptively selects high-value rollouts throughout training.<n>Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.
arXiv Detail & Related papers (2026-02-09T10:51:58Z) - When Is Compositional Reasoning Learnable from Verifiable Rewards? [10.475998113861896]
We study the learnability of compositional problems in autoregressive models under RLVR training.<n>We show that compositional problems where correct intermediate steps provide a clear advantage are efficiently learnable with RLVR.<n>On the negative side, when the structural advantage is not present, RLVR may converge to suboptimal compositions.
arXiv Detail & Related papers (2026-02-08T14:31:54Z) - Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards [13.064343544668283]
We propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering"<n>We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500.
arXiv Detail & Related papers (2025-11-21T18:23:04Z) - Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration [33.02780998281276]
Reinforcement learning with verifiable rewards (RLVR) has improved the reasoning ability of large language models.<n>This study investigates how simply leveraging intrinsic data properties, almost free benefit during training, can improve data efficiency for RLVR.
arXiv Detail & Related papers (2025-11-02T04:16:47Z) - Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning [3.437656066916039]
Reinforcement with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities.<n>We investigate RLVR on two problems with fully verifiable solutions.<n>We find that RLVR improves evaluation metrics but often by reinforcing superficial Learning metrics rather than acquiring new reasoning strategies.
arXiv Detail & Related papers (2025-10-30T23:16:02Z) - The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models [31.773914661815393]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving Large Language Models' reasoning capabilities.<n>Recent evidence suggests it may paradoxically shrink the reasoning boundary rather than expand it.<n>This paper investigates the shrinkage issue of RLVR by analyzing its learning dynamics.
arXiv Detail & Related papers (2025-10-02T17:17:27Z) - CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models [85.315711639214]
We introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration.<n>For the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture.<n>Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses.
arXiv Detail & Related papers (2025-09-11T17:59:17Z) - Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? [66.61292196146016]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs)<n>This study critically examines the current state of RLVR.<n>We find that the current training setup does not elicit fundamentally new reasoning patterns.
arXiv Detail & Related papers (2025-04-18T17:59:56Z) - Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z) - LSTSVR-PI: Least square twin support vector regression with privileged
information [0.0]
We propose a new least square twin support vector regression using privileged information (LSTSVR-PI)
It integrates the LUPI paradigm to utilize additional sources of information into the least square twin support vector regression.
The proposed model fills the gap between the contemporary paradigm of LUPI and classical LSTSVR.
arXiv Detail & Related papers (2023-12-05T09:15:10Z) - RvS: What is Essential for Offline RL via Supervised Learning? [77.91045677562802]
Recent work has shown that supervised learning alone, without temporal difference (TD) learning, can be remarkably effective for offline RL.
In every environment suite we consider simply maximizing likelihood with two-layer feedforward is competitive.
They also probe the limits of existing RvS methods, which are comparatively weak on random data.
arXiv Detail & Related papers (2021-12-20T18:55:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.