Lifetime policy reuse and the importance of task capacity
- URL: http://arxiv.org/abs/2106.01741v3
- Date: Fri, 20 Oct 2023 14:02:24 GMT
- Title: Lifetime policy reuse and the importance of task capacity
- Authors: David M. Bossens and Adam J. Sobey
- Abstract summary: Policy reuse and other multi-policy reinforcement learning techniques can learn multiple tasks but may generate many policies.
This paper presents two novel contributions, namely 1) Lifetime Policy Reuse, a model-agnostic policy reuse algorithm that avoids generating many policies.
The results demonstrate the importance of Lifetime Policy Reuse and task capacity based pre-selection on an 18-task partially observable Pacman domain and a Cartpole domain of up to 125 tasks.
- Score: 6.390849000337326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A long-standing challenge in artificial intelligence is lifelong
reinforcement learning, where learners are given many tasks in sequence and
must transfer knowledge between tasks while avoiding catastrophic forgetting.
Policy reuse and other multi-policy reinforcement learning techniques can learn
multiple tasks but may generate many policies. This paper presents two novel
contributions, namely 1) Lifetime Policy Reuse, a model-agnostic policy reuse
algorithm that avoids generating many policies by optimising a fixed number of
near-optimal policies through a combination of policy optimisation and adaptive
policy selection; and 2) the task capacity, a measure for the maximal number of
tasks that a policy can accurately solve. Comparing two state-of-the-art
base-learners, the results demonstrate the importance of Lifetime Policy Reuse
and task capacity based pre-selection on an 18-task partially observable Pacman
domain and a Cartpole domain of up to 125 tasks.
Related papers
- IOB: Integrating Optimization Transfer and Behavior Transfer for
Multi-Policy Reuse [50.90781542323258]
Reinforcement learning (RL) agents can transfer knowledge from source policies to a related target task.
Previous methods introduce additional components, such as hierarchical policies or estimations of source policies' value functions.
We propose a novel transfer RL method that selects the source policy without training extra components.
arXiv Detail & Related papers (2023-08-14T09:22:35Z) - Residual Q-Learning: Offline and Online Policy Customization without
Value [53.47311900133564]
Imitation Learning (IL) is a widely used framework for learning imitative behavior from demonstrations.
We formulate a new problem setting called policy customization.
We propose a novel framework, Residual Q-learning, which can solve the formulated MDP by leveraging the prior policy.
arXiv Detail & Related papers (2023-06-15T22:01:19Z) - On the Value of Myopic Behavior in Policy Reuse [67.37788288093299]
Leveraging learned strategies in unfamiliar scenarios is fundamental to human intelligence.
In this work, we present a framework called Selective Myopic bEhavior Control(SMEC)
SMEC adaptively aggregates the sharable short-term behaviors of prior policies and the long-term behaviors of the task policy, leading to coordinated decisions.
arXiv Detail & Related papers (2023-05-28T03:59:37Z) - Safety-Constrained Policy Transfer with Successor Features [19.754549649781644]
We propose a Constrained Markov Decision Process (CMDP) formulation that enables the transfer of policies and adherence to safety constraints.
Our approach relies on a novel extension of generalized policy improvement to constrained settings via a Lagrangian formulation.
Our experiments in simulated domains show that our approach is effective; it visits unsafe states less frequently and outperforms alternative state-of-the-art methods when taking safety constraints into account.
arXiv Detail & Related papers (2022-11-10T06:06:36Z) - Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks.
We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z) - Towards an Understanding of Default Policies in Multitask Policy
Optimization [29.806071693039655]
Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms.
We take a first step towards filling this gap by formally linking the quality of the default policy to its effect on optimization.
We then derive a principled RPO algorithm for multitask learning with strong performance guarantees.
arXiv Detail & Related papers (2021-11-04T16:45:15Z) - Goal-Conditioned Reinforcement Learning with Imagined Subgoals [89.67840168694259]
We propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks.
Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic.
We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.
arXiv Detail & Related papers (2021-07-01T15:30:59Z) - Lifelong Policy Gradient Learning of Factored Policies for Faster
Training Without Forgetting [26.13332231423652]
We provide a novel method for lifelong policy gradient learning that trains lifelong function approximators directly via policy gradients.
We show empirically that our algorithm learns faster and converges to better policies than single-task and lifelong learning baselines.
arXiv Detail & Related papers (2020-07-14T13:05:42Z) - Accelerating Safe Reinforcement Learning with Constraint-mismatched
Policies [34.555500347840805]
We consider the problem of reinforcement learning when provided with a baseline control policy and a set of constraints that the learner must satisfy.
We propose an iterative policy optimization algorithm that alternates between maximizing expected return on the task, minimizing distance to the baseline policy, and projecting the policy onto the constraint-satisfying set.
Our algorithm consistently outperforms several state-of-the-art baselines, achieving 10 times fewer constraint violations and 40% higher reward on average.
arXiv Detail & Related papers (2020-06-20T20:20:47Z) - Learning Adaptive Exploration Strategies in Dynamic Environments Through
Informed Policy Regularization [100.72335252255989]
We study the problem of learning exploration-exploitation strategies that effectively adapt to dynamic environments.
We propose a novel algorithm that regularizes the training of an RNN-based policy using informed policies trained to maximize the reward in each task.
arXiv Detail & Related papers (2020-05-06T16:14:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.