Continual Policy Distillation from Distributed Reinforcement Learning Teachers
- URL: http://arxiv.org/abs/2601.22475v1
- Date: Fri, 30 Jan 2026 02:40:34 GMT
- Title: Continual Policy Distillation from Distributed Reinforcement Learning Teachers
- Authors: Yuxuan Li, Qijun He, Mingqi Yuan, Wen-Tse Chen, Jeff Schneider, Jiayu Chen,
- Abstract summary: Continual Reinforcement Learning aims to develop lifelong learning agents to continuously acquire knowledge across diverse tasks.<n>This requires efficiently managing the stability-plasticity dilemma and leveraging prior experience to rapidly generalize to novel tasks.<n>We propose a novel teacher-student framework that decouples CRL into two independent processes: training single-task teacher models through distributed RL and continually distilling them into a central generalist model.
- Score: 14.879372764916154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continual Reinforcement Learning (CRL) aims to develop lifelong learning agents to continuously acquire knowledge across diverse tasks while mitigating catastrophic forgetting. This requires efficiently managing the stability-plasticity dilemma and leveraging prior experience to rapidly generalize to novel tasks. While various enhancement strategies for both aspects have been proposed, achieving scalable performance by directly applying RL to sequential task streams remains challenging. In this paper, we propose a novel teacher-student framework that decouples CRL into two independent processes: training single-task teacher models through distributed RL and continually distilling them into a central generalist model. This design is motivated by the observation that RL excels at solving single tasks, while policy distillation -- a relatively stable supervised learning process -- is well aligned with large foundation models and multi-task learning. Moreover, a mixture-of-experts (MoE) architecture and a replay-based approach are employed to enhance the plasticity and stability of the continual policy distillation process. Extensive experiments on the Meta-World benchmark demonstrate that our framework enables efficient continual RL, recovering over 85% of teacher performance while constraining task-wise forgetting to within 10%.
Related papers
- Reinforcement-aware Knowledge Distillation for LLM Reasoning [63.53679456364683]
Reinforcement learning (RL) post-training has recently driven gains in long chain-of-thought reasoning large language models (LLMs)<n>Most existing knowledge distillation methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization.<n>We propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update.
arXiv Detail & Related papers (2026-02-26T00:20:39Z) - ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning [75.73135757250806]
Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks.<n>Despite encouraging early results, ARL remains highly unstable, often leading to training collapse.<n>In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting.
arXiv Detail & Related papers (2026-02-25T03:43:34Z) - Sample-Efficient Neurosymbolic Deep Reinforcement Learning [49.60927398960061]
We propose a neuro-symbolic Deep RL approach that integrates background symbolic knowledge to improve sample efficiency.<n>Online reasoning is performed to guide the training process through two mechanisms.<n>We show improved performance over a state-of-the-art reward machine baseline.
arXiv Detail & Related papers (2026-01-06T09:28:53Z) - Demonstration-Guided Continual Reinforcement Learning in Dynamic Environments [8.818727691237656]
Reinforcement learning (RL) excels in various applications but struggles in dynamic environments where the underlying Markov decision process evolves.<n>We propose demonstration-guided continual reinforcement learning (DGCRL), which stores prior knowledge in an external, self-evolving demonstration repository.<n>Experiments on 2D navigation and MuJoCo locomotion tasks demonstrate its superior average performance, enhanced knowledge transfer, mitigation of forgetting, and training efficiency.
arXiv Detail & Related papers (2025-12-21T10:13:21Z) - Causal-Paced Deep Reinforcement Learning [4.728991543521559]
Causal-Paced Deep Reinforcement Learning (CP-DRL) is a curriculum learning framework aware of SCM differences between tasks based on interaction data approximation.<n> Empirically, CP-DRL outperforms existing curriculum methods on the Point Mass benchmark.
arXiv Detail & Related papers (2025-06-24T20:15:01Z) - Bigger, Regularized, Categorical: High-Capacity Value Functions are Efficient Multi-Task Learners [60.75160178669076]
We show that the use of high-capacity value models trained via cross-entropy and conditioned on learnable task embeddings addresses the problem of task interference in online reinforcement learning.<n>We test our approach on 7 multi-task benchmarks with over 280 unique tasks, spanning high degree-of-freedom humanoid control and discrete vision-based RL.
arXiv Detail & Related papers (2025-05-29T06:41:45Z) - SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks [110.20297293596005]
Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks.<n>Existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs.<n>We propose a novel RL algorithm, SWEET-RL, that uses a carefully designed optimization objective to train a critic model with access to additional training-time information.<n>Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms.
arXiv Detail & Related papers (2025-03-19T17:55:08Z) - Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds [25.30163649182171]
In-Context Reinforcement Learning (ICRL) enables agents to learn automatically and on-the-fly from their interactive experiences.<n>We propose the procedurally generated Markov Decision Processes, named AnyMDP.<n>Through a carefully designed randomization process, AnyMDP is capable of generating high-quality tasks on a large scale while maintaining relatively low structural biases.<n>Our results demonstrate that, with a sufficiently large scale of AnyMDP tasks, the proposed model can generalize tasks that were not considered in the training set through versatile in-context learning paradigms.
arXiv Detail & Related papers (2025-02-05T03:59:13Z) - Continual Task Learning through Adaptive Policy Self-Composition [54.95680427960524]
CompoFormer is a structure-based continual transformer model that adaptively composes previous policies via a meta-policy network.
Our experiments reveal that CompoFormer outperforms conventional continual learning (CL) methods, particularly in longer task sequences.
arXiv Detail & Related papers (2024-11-18T08:20:21Z) - Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks [5.968716050740402]
This paper focuses on addressing and exploiting estimation biases in Actor-Critic methods for continuous control tasks.
We design a Bias Exploiting (BE) mechanism to dynamically select the most advantageous estimation bias during training of the RL agent.
Most State-of-the-art Deep RL algorithms can be equipped with the BE mechanism, without hindering performance or computational complexity.
arXiv Detail & Related papers (2024-02-14T10:44:03Z) - Persistent Reinforcement Learning via Subgoal Curricula [114.83989499740193]
Value-accelerated Persistent Reinforcement Learning (VaPRL) generates a curriculum of initial states.
VaPRL reduces the interventions required by three orders of magnitude compared to episodic reinforcement learning.
arXiv Detail & Related papers (2021-07-27T16:39:45Z) - Knowledge Transfer in Multi-Task Deep Reinforcement Learning for
Continuous Control [65.00425082663146]
We present a Knowledge Transfer based Multi-task Deep Reinforcement Learning framework (KTM-DRL) for continuous control.
In KTM-DRL, the multi-task agent first leverages an offline knowledge transfer algorithm to quickly learn a control policy from the experience of task-specific teachers.
The experimental results well justify the effectiveness of KTM-DRL and its knowledge transfer and online learning algorithms, as well as its superiority over the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-10-15T03:26:47Z) - Self-Paced Deep Reinforcement Learning [42.467323141301826]
Curriculum reinforcement learning (CRL) improves the learning speed and stability of an agent by exposing it to a tailored series of tasks throughout learning.
Despite empirical successes, an open question in CRL is how to automatically generate a curriculum for a given reinforcement learning (RL) agent, avoiding manual design.
We propose an answer by interpreting the curriculum generation as an inference problem, where distributions over tasks are progressively learned to approach the target task.
This approach leads to an automatic curriculum generation, whose pace is controlled by the agent, with solid theoretical motivation and easily integrated with deep RL algorithms.
arXiv Detail & Related papers (2020-04-24T15:48:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.