Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective
- URL: http://arxiv.org/abs/2505.17652v2
- Date: Thu, 29 May 2025 11:48:49 GMT
- Title: Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective
- Authors: Deyang Kong, Qi Guo, Xiangyu Xi, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, Wei Ye,
- Abstract summary: Reinforcement learning exhibits potential in enhancing the reasoning abilities of large language models.<n>Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties.<n>This paper introduces $textbfC$ompetence-$textbfD$ifficulty, which enables accurate and stable estimation of problem difficulties.
- Score: 27.94738910330893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning exhibits potential in enhancing the reasoning abilities of large language models, yet it is hard to scale for the low sample efficiency during the rollout phase. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To tackle these limitations, this paper introduces $\textbf{C}$ompetence-$\textbf{D}$ifficulty $\textbf{A}$lignment $\textbf{S}$ampling ($\textbf{CDAS}$), which enables accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies of problems. Then the model competence is quantified to adaptively select problems whose difficulty is in alignment with the model's current competence using a fixed-point system. Experimental results across a range of challenging mathematical benchmarks show that CDAS achieves great improvements in both accuracy and efficiency. CDAS attains the highest average accuracy against baselines and exhibits significant speed advantages compared to Dynamic Sampling, a competitive strategy in DAPO, which is 2.33 times slower than CDAS.
Related papers
- GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning [15.43938821214447]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs)<n>This paper introduces Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework.<n>GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance.
arXiv Detail & Related papers (2025-07-14T08:10:00Z) - TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs [50.820065021136024]
DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs)<n>Recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings.<n>We propose TACO, a novel reinforcement learning algorithm for visual reasoning.
arXiv Detail & Related papers (2025-05-27T06:30:48Z) - HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation [23.476410355434655]
Self-taught reasoners (STaRs) enhance the mathematical reasoning abilities of large language models (LLMs) by leveraging self-generated responses for self-training.<n>We propose HS-STaR, a Hierarchical Sampling framework for Self-Taught Reasoners.
arXiv Detail & Related papers (2025-05-26T11:50:16Z) - PATS: Process-Level Adaptive Thinking Mode Switching [53.53401063490537]
Current large-language models (LLMs) typically adopt a fixed reasoning strategy, either simple or complex, for all questions, regardless of their difficulty.<n>This neglect of variation in task and reasoning process complexity leads to an imbalance between performance and efficiency.<n>Existing methods attempt to implement training-free fast-slow thinking system switching to handle problems of varying difficulty, but are limited by coarse-grained solution-level strategy adjustments.<n>We propose a novel reasoning paradigm: Process-Level Adaptive Thinking Mode Switching (PATS), which enables LLMs to dynamically adjust their reasoning strategy based on the difficulty of each step, optimizing the balance between
arXiv Detail & Related papers (2025-05-25T17:58:50Z) - Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs' Reasoning [81.50681925980135]
We propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps.<n>It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making.<n>Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results.
arXiv Detail & Related papers (2025-05-23T12:42:50Z) - Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning [69.64809103333839]
We investigate how explicitly modeling problem's difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning.<n>Our approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.
arXiv Detail & Related papers (2025-05-19T15:43:10Z) - DAST: Difficulty-Aware Self-Training on Large Language Models [68.30467836807362]
Large Language Models (LLM) self-training methods always under-sample on challenging queries.<n>This work proposes a difficulty-aware self-training framework that focuses on improving the quantity and quality of self-generated responses.
arXiv Detail & Related papers (2025-03-12T03:36:45Z) - Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization [15.212942734663514]
CrossQ has demonstrated state-of-the-art sample efficiency with a low update-to-data (UTD) ratio of 1.<n>We identify challenges in the training dynamics, which are emphasized by higher UTD ratios.<n>Our proposed approach reliably scales with increasing UTD ratios, achieving competitive performance across 25 challenging continuous control tasks.
arXiv Detail & Related papers (2025-02-11T12:55:32Z) - SeMi: When Imbalanced Semi-Supervised Learning Meets Mining Hard Examples [54.760757107700755]
Semi-Supervised Learning (SSL) can leverage abundant unlabeled data to boost model performance.<n>The class-imbalanced data distribution in real-world scenarios poses great challenges to SSL, resulting in performance degradation.<n>We propose a method that enhances the performance of Imbalanced Semi-Supervised Learning by Mining Hard Examples (SeMi)
arXiv Detail & Related papers (2025-01-10T14:35:16Z) - Mitigating Tail Narrowing in LLM Self-Improvement via Socratic-Guided Sampling [38.7578639980701]
Self-improvement methods enable large language models to generate solutions themselves.<n>We find that models tend to over-sample on easy queries and under-sample on queries they have yet to master.<n>We introduce Guided Self-Improvement (GSI), a strategy aimed at improving the efficiency of sampling challenging heavy-tailed data.
arXiv Detail & Related papers (2024-11-01T17:18:45Z) - Is Difficulty Calibration All We Need? Towards More Practical Membership Inference Attacks [16.064233621959538]
We propose a query-efficient and computation-efficient MIA that directly textbfRe-levertextbfAges the original membershitextbfP scores to mtextbfItigate the errors in textbfDifficulty calibration.
arXiv Detail & Related papers (2024-08-31T11:59:42Z) - Large Language Models are Miscalibrated In-Context Learners [22.30783674111999]
In this work, we deliver an in-depth analysis of the behavior across different choices of learning methods.<n>We observe that the miscalibration problem exists across all learning methods in low-resource setups.<n>We find that self-ensembling with max probability produces robust and calibrated predictions.
arXiv Detail & Related papers (2023-12-21T11:55:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.