Soft Action Priors: Towards Robust Policy Transfer
- URL: http://arxiv.org/abs/2209.09882v1
- Date: Tue, 20 Sep 2022 17:36:28 GMT
- Title: Soft Action Priors: Towards Robust Policy Transfer
- Authors: Matheus Centa and Philippe Preux
- Abstract summary: We use the action prior from the Reinforcement Learning as Inference framework to recover state-of-the-art policy distillation techniques.
Then, we propose a class of adaptive methods that can robustly exploit action priors by combining reward shaping and auxiliary regularization losses.
We show that the proposed methods achieve state-of-the-art performance, surpassing it when learning from suboptimal priors.
- Score: 9.860944032009847
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite success in many challenging problems, reinforcement learning (RL) is
still confronted with sample inefficiency, which can be mitigated by
introducing prior knowledge to agents. However, many transfer techniques in
reinforcement learning make the limiting assumption that the teacher is an
expert. In this paper, we use the action prior from the Reinforcement Learning
as Inference framework - that is, a distribution over actions at each state
which resembles a teacher policy, rather than a Bayesian prior - to recover
state-of-the-art policy distillation techniques. Then, we propose a class of
adaptive methods that can robustly exploit action priors by combining reward
shaping and auxiliary regularization losses. In contrast to prior work, we
develop algorithms for leveraging suboptimal action priors that may
nevertheless impart valuable knowledge - which we call soft action priors. The
proposed algorithms adapt by adjusting the strength of teacher feedback
according to an estimate of the teacher's usefulness in each state. We perform
tabular experiments, which show that the proposed methods achieve
state-of-the-art performance, surpassing it when learning from suboptimal
priors. Finally, we demonstrate the robustness of the adaptive algorithms in
continuous action deep RL problems, in which adaptive algorithms considerably
improved stability when compared to existing policy distillation methods.
Related papers
- Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets [6.5472155063246085]
Constrained reinforcement learning has achieved promising progress in safety-critical fields where both rewards and constraints are considered.
We propose Adrial Constrained Policy Optimization (ACPO), which enables simultaneous optimization of reward and the adaptation of cost budgets during training.
arXiv Detail & Related papers (2024-10-28T07:04:32Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Efficient Exploration Using Extra Safety Budget in Constrained Policy
Optimization [15.483557012655927]
We propose an algorithm named Constrained Policy Optimization with Extra Safety Budget (ESB-CPO) to strike a balance between the exploration efficiency and the constraints satisfaction.
Our method gains remarkable performance improvement under the same cost limit compared with baselines.
arXiv Detail & Related papers (2023-02-28T06:16:34Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - The Information Geometry of Unsupervised Reinforcement Learning [133.20816939521941]
Unsupervised skill discovery is a class of algorithms that learn a set of policies without access to a reward function.
We show that unsupervised skill discovery algorithms do not learn skills that are optimal for every possible reward function.
arXiv Detail & Related papers (2021-10-06T13:08:36Z) - State Augmented Constrained Reinforcement Learning: Overcoming the
Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds.
We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards.
This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z) - Escaping from Zero Gradient: Revisiting Action-Constrained Reinforcement
Learning via Frank-Wolfe Policy Optimization [5.072893872296332]
Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications.
We propose a learning algorithm that decouples the action constraints from the policy parameter update.
We show that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.
arXiv Detail & Related papers (2021-02-22T14:28:03Z) - Average Reward Adjusted Discounted Reinforcement Learning:
Near-Blackwell-Optimal Policies for Real-World Applications [0.0]
Reinforcement learning aims at finding the best stationary policy for a given Markov Decision Process.
This paper provides deep theoretical insights to the widely applied standard discounted reinforcement learning framework.
We establish a novel near-Blackwell-optimal reinforcement learning algorithm.
arXiv Detail & Related papers (2020-04-02T08:05:18Z) - Efficient Deep Reinforcement Learning via Adaptive Policy Transfer [50.51637231309424]
Policy Transfer Framework (PTF) is proposed to accelerate Reinforcement Learning (RL)
Our framework learns when and which source policy is the best to reuse for the target policy and when to terminate it.
Experimental results show it significantly accelerates the learning process and surpasses state-of-the-art policy transfer methods.
arXiv Detail & Related papers (2020-02-19T07:30:57Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.