Optimistic Linear Support and Successor Features as a Basis for Optimal
Policy Transfer
- URL: http://arxiv.org/abs/2206.11326v1
- Date: Wed, 22 Jun 2022 19:00:08 GMT
- Title: Optimistic Linear Support and Successor Features as a Basis for Optimal
Policy Transfer
- Authors: Lucas N. Alegre and Ana L. C. Bazzan and Bruno C. da Silva
- Abstract summary: We introduce an SF-based extension of the Optimistic Linear Support algorithm to learn a set of policies whose SFs form a convex coverage set.
We prove that policies in this set can be combined via generalized policy improvement to construct optimal behaviors for any new linearly-expressible tasks.
- Score: 7.970144204429356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many real-world applications, reinforcement learning (RL) agents might
have to solve multiple tasks, each one typically modeled via a reward function.
If reward functions are expressed linearly, and the agent has previously
learned a set of policies for different tasks, successor features (SFs) can be
exploited to combine such policies and identify reasonable solutions for new
problems. However, the identified solutions are not guaranteed to be optimal.
We introduce a novel algorithm that addresses this limitation. It allows RL
agents to combine existing policies and directly identify optimal policies for
arbitrary new problems, without requiring any further interactions with the
environment. We first show (under mild assumptions) that the transfer learning
problem tackled by SFs is equivalent to the problem of learning to optimize
multiple objectives in RL. We then introduce an SF-based extension of the
Optimistic Linear Support algorithm to learn a set of policies whose SFs form a
convex coverage set. We prove that policies in this set can be combined via
generalized policy improvement to construct optimal behaviors for any new
linearly-expressible tasks, without requiring any additional training samples.
We empirically show that our method outperforms state-of-the-art competing
algorithms both in discrete and continuous domains under value function
approximation.
Related papers
- Planning with a Learned Policy Basis to Optimally Solve Complex Tasks [26.621462241759133]
We propose to use successor features to learn a policy basis so that each (sub)policy in it solves a well-defined subproblem.
In a task described by a finite state automaton (FSA) that involves the same set of subproblems, the combination of these (sub)policies can then be used to generate an optimal solution without additional learning.
arXiv Detail & Related papers (2024-03-22T15:51:39Z) - Distributional Successor Features Enable Zero-Shot Policy Optimization [36.53356539916603]
This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs)
DiSPOs learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset.
By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions.
arXiv Detail & Related papers (2024-03-10T22:27:21Z) - Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with
Expert Guidance [74.31779732754697]
We propose a novel plug-in approach named Guided Offline RL (GORL)
GORL employs a guiding network, along with only a few expert demonstrations, to adaptively determine the relative importance of the policy improvement and policy constraint for every sample.
Experiments on various environments suggest that GORL can be easily installed on most offline RL algorithms with statistically significant performance improvements.
arXiv Detail & Related papers (2023-09-04T08:59:04Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Sample-Efficient Multi-Objective Learning via Generalized Policy
Improvement Prioritization [8.836422771217084]
Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences.
We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes.
We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks.
arXiv Detail & Related papers (2023-01-18T20:54:40Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Bellman Residual Orthogonalization for Offline Reinforcement Learning [53.17258888552998]
We introduce a new reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along a test function space.
We exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class.
arXiv Detail & Related papers (2022-03-24T01:04:17Z) - Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks.
We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - First Order Constrained Optimization in Policy Space [19.00289722198614]
We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS)
FOCOPS maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints.
We provide empirical evidence that our simple approach achieves better performance on a set of constrained robotics locomotive tasks.
arXiv Detail & Related papers (2020-02-16T05:07:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.