Related papers: Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

URL: http://arxiv.org/abs/2602.20532v1
Date: Tue, 24 Feb 2026 04:19:48 GMT
Title: Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training
Authors: Zhengyao Gu, Jonathan Light, Raul Astudillo, Ziyu Ye, Langzhou He, Henry Peng Zou, Wei Cheng, Santiago Paternain, Philip S. Yu, Yisong Yue,
Abstract summary: ACTOR-CURATOR is a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models.<n> Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines.
Score: 63.34044358216334
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.

Related papers

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning [12.863583402455008]
Batch Adaptation Policy Optimization (BAPO) is an off-policy RLVR framework to improve the data efficiency in large language models post-training.<n>It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones.<n>BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks.
arXiv Detail & Related papers (2026-02-24T09:35:43Z)
Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards [69.74686029941881]
Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models.<n>We propose a unified neural scheduling framework that adaptively selects high-value rollouts throughout training.<n>Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.
arXiv Detail & Related papers (2026-02-09T10:51:58Z)
Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement [21.073482007189504]
Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks.<n> reinforcement learning under verifiable rewards (RLVR) is emerging as a principled framework for aligning model behavior with reasoning chains.<n>Despite its promise, RLVR remains prohibitively resource-intensive, requiring extensive reward signals and incurring substantial rollout costs during training.
arXiv Detail & Related papers (2026-01-31T16:51:50Z)
Human-in-the-loop Online Rejection Sampling for Robotic Manipulation [55.99788088622936]
Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning.<n>Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training.
arXiv Detail & Related papers (2025-10-30T11:53:08Z)
Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR [110.90317717368264]
We propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training.<n>This strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR.
arXiv Detail & Related papers (2025-08-19T17:42:45Z)
Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward [85.84943447589511]
This paper introduces AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences.<n>To make this learning process stable and data-efficient, AC3 incorporates targeted stabilization mechanisms for both the actor and the critic.
arXiv Detail & Related papers (2025-08-15T01:27:15Z)
GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning [15.43938821214447]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs)<n>This paper introduces Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework.<n>GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance.
arXiv Detail & Related papers (2025-07-14T08:10:00Z)
Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning [8.537540092998311]
Reasoning-Oriented Reinforcement Learning (RORL) enhances the reasoning ability of Large Language Models (LLMs)<n>We show that curating the batch with the problems that the training model achieves intermediate accuracy on the fly can maximize the effectiveness of RORL training.
arXiv Detail & Related papers (2025-04-04T11:52:05Z)
Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach [2.743898388459522]
In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve.<n>Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments.<n>We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps.
arXiv Detail & Related papers (2024-10-16T14:15:28Z)
Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs) We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z)
Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment. We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent. We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.