Same State, Different Task: Continual Reinforcement Learning without
Interference
- URL: http://arxiv.org/abs/2106.02940v1
- Date: Sat, 5 Jun 2021 17:55:10 GMT
- Title: Same State, Different Task: Continual Reinforcement Learning without
Interference
- Authors: Samuel Kessler, Jack Parker-Holder, Philip Ball, Stefan Zohren,
Stephen J. Roberts
- Abstract summary: Key challenge in Continual Learning (CL) is catastrophic forgetting, which arises when performance on a previously mastered task is reduced when learning a new task.
We show that existing CL methods based on single neural network predictors with shared replay buffers fail in the presence of interference.
We propose a simple method, OWL, to address this challenge. OWL learns a factorized policy, using shared feature extraction layers, but separate heads, each specializing on a new task.
- Score: 21.560701568064864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Continual Learning (CL) considers the problem of training an agent
sequentially on a set of tasks while seeking to retain performance on all
previous tasks. A key challenge in CL is catastrophic forgetting, which arises
when performance on a previously mastered task is reduced when learning a new
task. While a variety of methods exist to combat forgetting, in some cases
tasks are fundamentally incompatible with each other and thus cannot be learnt
by a single policy. This can occur, in reinforcement learning (RL) when an
agent may be rewarded for achieving different goals from the same observation.
In this paper we formalize this ``interference'' as distinct from the problem
of forgetting. We show that existing CL methods based on single neural network
predictors with shared replay buffers fail in the presence of interference.
Instead, we propose a simple method, OWL, to address this challenge. OWL learns
a factorized policy, using shared feature extraction layers, but separate
heads, each specializing on a new task. The separate heads in OWL are used to
prevent interference. At test time, we formulate policy selection as a
multi-armed bandit problem, and show it is possible to select the best policy
for an unknown task using feedback from the environment. The use of bandit
algorithms allows the OWL agent to constructively re-use different continually
learnt policies at different times during an episode. We show in multiple RL
environments that existing replay based CL methods fail, while OWL is able to
achieve close to optimal performance when training sequentially.
Related papers
- Offline Reinforcement Learning from Datasets with Structured Non-Stationarity [50.35634234137108]
Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy.
We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode.
We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation.
arXiv Detail & Related papers (2024-05-23T02:41:36Z) - Learning impartial policies for sequential counterfactual explanations
using Deep Reinforcement Learning [0.0]
Recently Reinforcement Learning (RL) methods have been proposed that seek to learn policies for discovering SCFs, thereby enhancing scalability.
In this work, we identify shortcomings in existing methods that can result in policies with undesired properties, such as a bias towards specific actions.
We propose to use the output probabilities of the classifier to create a more informative reward, to mitigate this effect.
arXiv Detail & Related papers (2023-11-01T13:50:47Z) - Prior-Free Continual Learning with Unlabeled Data in the Wild [24.14279172551939]
We propose a Prior-Free Continual Learning (PFCL) method to incrementally update a trained model on new tasks.
PFCL learns new tasks without knowing the task identity or any previous data.
Our experiments show that our PFCL method significantly mitigates forgetting in all three learning scenarios.
arXiv Detail & Related papers (2023-10-16T13:59:56Z) - SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended
Exploration [21.764280583041703]
Skill reuse is one of the most common approaches, but current methods have considerable limitations.
We introduce an alternative approach to mitigate these problems.
Our approach learns to sequence existing temporally-extended skills for exploration but learns the final policy directly from the raw experience.
arXiv Detail & Related papers (2022-11-24T18:05:01Z) - You Only Live Once: Single-Life Reinforcement Learning [124.1738675154651]
In many real-world situations, the goal might not be to learn a policy that can do the task repeatedly, but simply to perform a new task successfully once in a single trial.
We formalize this problem setting, where an agent must complete a task within a single episode without interventions.
We propose an algorithm, $Q$-weighted adversarial learning (QWALE), which employs a distribution matching strategy.
arXiv Detail & Related papers (2022-10-17T09:00:11Z) - Meta Reinforcement Learning with Successor Feature Based Context [51.35452583759734]
We propose a novel meta-RL approach that achieves competitive performance comparing to existing meta-RL algorithms.
Our method does not only learn high-quality policies for multiple tasks simultaneously but also can quickly adapt to new tasks with a small amount of training.
arXiv Detail & Related papers (2022-07-29T14:52:47Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z) - SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep
Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms.
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z) - Hierarchical Reinforcement Learning as a Model of Human Task
Interleaving [60.95424607008241]
We develop a hierarchical model of supervisory control driven by reinforcement learning.
The model reproduces known empirical effects of task interleaving.
The results support hierarchical RL as a plausible model of task interleaving.
arXiv Detail & Related papers (2020-01-04T17:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.