Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning
- URL: http://arxiv.org/abs/2504.03380v1
- Date: Fri, 04 Apr 2025 11:52:05 GMT
- Title: Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning
- Authors: Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, JeongYeon Nam, Donghyun Kwak,
- Abstract summary: Reasoning-Oriented Reinforcement Learning (RORL) enhances the reasoning ability of Large Language Models (LLMs)<n>We show that curating the batch with the problems that the training model achieves intermediate accuracy on the fly can maximize the effectiveness of RORL training.
- Score: 8.537540092998311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning-Oriented Reinforcement Learning (RORL) enhances the reasoning ability of Large Language Models (LLMs). However, due to the sparsity of rewards in RORL, effective training is highly dependent on the selection of problems of appropriate difficulty. Although curriculum learning attempts to address this by adjusting difficulty, it often relies on static schedules, and even recent online filtering methods lack theoretical grounding and a systematic understanding of their effectiveness. In this work, we theoretically and empirically show that curating the batch with the problems that the training model achieves intermediate accuracy on the fly can maximize the effectiveness of RORL training, namely balanced online difficulty filtering. We first derive that the lower bound of the KL divergence between the initial and the optimal policy can be expressed with the variance of the sampled accuracy. Building on those insights, we show that balanced filtering can maximize the lower bound, leading to better performance. Experimental results across five challenging math reasoning benchmarks show that balanced online filtering yields an additional 10% in AIME and 4% improvements in average over plain GRPO. Moreover, further analysis shows the gains in sample efficiency and training time efficiency, exceeding the maximum reward of plain GRPO within 60% training time and the volume of the training set.
Related papers
- Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training [63.34044358216334]
ACTOR-CURATOR is a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models.<n> Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines.
arXiv Detail & Related papers (2026-02-24T04:19:48Z) - Human-in-the-loop Online Rejection Sampling for Robotic Manipulation [55.99788088622936]
Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning.<n>Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training.
arXiv Detail & Related papers (2025-10-30T11:53:08Z) - DARO: Difficulty-Aware Reweighting Policy Optimization [18.07946696398167]
Group Relative Policy Optimization ( GRPO) has emerged as the de facto approach for Reinforcement Learning with Verifiable Rewards (RLVR)<n>We provide a unified view, demonstrating that their reliance on static or overly simplistic weighting schemes tied to sample difficulty prevents adaptation to a model's evolving capabilities.<n>We introduce bfbfDifficulty-Aware Reweighting Policy Optimization (DARO), a method that dynamically adjusts the loss contribution of each difficulty group based on the model's learning state.
arXiv Detail & Related papers (2025-10-10T04:57:15Z) - CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z) - Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding [59.60915947702282]
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs)<n>Existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability.<n>We propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region.
arXiv Detail & Related papers (2025-09-08T17:36:21Z) - Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z) - Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [53.239242017802056]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z) - GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning [15.43938821214447]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs)<n>This paper introduces Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework.<n>GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance.
arXiv Detail & Related papers (2025-07-14T08:10:00Z) - A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning [0.40964539027092906]
Supervised Fine-Tuning and Reinforcement Learning are the dominant training paradigms.<n>This paper introduces a practical and effective training recipe that strategically integrates extended SFT with RL from online inference.<n>Our experiments reveal that extending SFT for as many as 10 epochs is crucial for performance breakthroughs.<n>This work provides the community with a battle-tested blueprint for developing state-of-the-art mathematical reasoners.
arXiv Detail & Related papers (2025-07-11T02:26:01Z) - SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning [30.90778400005588]
Training large language models with reinforcement learning (RL) against verifiable rewards significantly enhances their reasoning abilities.<n>We introduce Selective Prompting with Efficient Estimation of Difficulty (SPEED), an adaptive online RL curriculum that selectively chooses training examples of intermediate difficulty to maximize learning efficiency.<n> Empirically, our efficient implementation leads to 2x to 6x faster training without degrading accuracy, requires no manual tuning, and integrates seamlessly into standard RL algorithms.
arXiv Detail & Related papers (2025-06-10T17:42:42Z) - Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay [61.823835392216544]
Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs)<n>We propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay.<n>Our method reduces RL fine-tuning time by 25% to 65% to reach the same level of performance as the original GRPO algorithm.
arXiv Detail & Related papers (2025-06-05T17:55:43Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [82.43575191712726]
We introduce a fine-grained analytic framework to dissect the impact ofReinforcement learning on reasoning.<n>Our framework specifically investigates key elements that have been hypothesized to benefit from RL training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z) - A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.
We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.
Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z) - DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training [15.74527731339671]
We present a principled curriculum learning framework grounded in the notion of distribution-level learnability.
Our framework prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration)
Our experiments show that our framework significantly improves convergence speed and final performance.
arXiv Detail & Related papers (2025-04-13T20:10:27Z) - Fast Adaptation with Behavioral Foundation Models [82.34700481726951]
Unsupervised zero-shot reinforcement learning has emerged as a powerful paradigm for pretraining behavioral foundation models.
Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process.
We propose fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies.
arXiv Detail & Related papers (2025-04-10T16:14:17Z) - Efficient Reinforcement Finetuning via Adaptive Curriculum Learning [24.52451100497884]
Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs)
AdaRFT dynamically adjusts the difficulty of training problems based on the model's recent reward signals.
AdaRFT reduces training time by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.
arXiv Detail & Related papers (2025-04-07T21:31:31Z) - How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study [16.441081996257576]
This paper presents a rigorous experimental investigation into how difficulty-aware staged reinforcement learning strategies can substantially improve reasoning performance.<n>We show that strategically selecting training data according to well-defined difficulty levels markedly enhances RL optimization.<n>We will open-source our datasets on GitHub and Hugging Face.
arXiv Detail & Related papers (2025-04-01T14:18:38Z) - Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs)
We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data.
We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z) - Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem.
PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins.
Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z) - PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation [61.57833648734164]
We propose a novel Parallel Yielding Re-Activation (PYRA) method for training-inference efficient task adaptation.
PYRA outperforms all competing methods under both low compression rate and high compression rate.
arXiv Detail & Related papers (2024-03-14T09:06:49Z) - RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences [23.414135977983953]
Preference-based Reinforcement Learning (PbRL) circumvents the need for reward engineering by harnessing human preferences as the reward signal.
We present RIME, a robust PbRL algorithm for effective reward learning from noisy preferences.
arXiv Detail & Related papers (2024-02-27T07:03:25Z) - Parameter-Efficient Learning for Text-to-Speech Accent Adaptation [58.356667204518985]
This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS)
A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2% to 0.8% of original trainable parameters.
Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning.
arXiv Detail & Related papers (2023-05-18T22:02:59Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Curriculum Learning: A Regularization Method for Efficient and Stable
Billion-Scale GPT Model Pre-Training [18.640076155697415]
We present a study of a curriculum learning based approach, which helps improve the pre-training convergence speed of autoregressive models.
Our evaluations demonstrate that curriculum learning enables training GPT-2 models with 8x larger batch size and 4x larger learning rate.
arXiv Detail & Related papers (2021-08-13T06:32:53Z) - SASL: Saliency-Adaptive Sparsity Learning for Neural Network
Acceleration [20.92912642901645]
We propose a Saliency-Adaptive Sparsity Learning (SASL) approach for further optimization.
Our method can reduce 49.7% FLOPs of ResNet-50 with very negligible 0.39% top-1 and 0.05% top-5 accuracy degradation.
arXiv Detail & Related papers (2020-03-12T16:49:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.