Useful Policy Invariant Shaping from Arbitrary Advice
- URL: http://arxiv.org/abs/2011.01297v1
- Date: Mon, 2 Nov 2020 20:29:09 GMT
- Title: Useful Policy Invariant Shaping from Arbitrary Advice
- Authors: Paniz Behboudian, Yash Satsangi, Matthew E. Taylor, Anna Harutyunyan,
Michael Bowling
- Abstract summary: A major challenge of RL research is to discover how to learn with less data.
Potential-based reward shaping (PBRS) holds promise, but it is limited by the need for a well-defined potential function.
The recently introduced dynamic potential based advice (DPBA) method tackles this challenge by admitting arbitrary advice from a human or other agent.
- Score: 24.59807772487328
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning is a powerful learning paradigm in which agents can
learn to maximize sparse and delayed reward signals. Although RL has had many
impressive successes in complex domains, learning can take hours, days, or even
years of training data. A major challenge of contemporary RL research is to
discover how to learn with less data. Previous work has shown that domain
information can be successfully used to shape the reward; by adding additional
reward information, the agent can learn with much less data. Furthermore, if
the reward is constructed from a potential function, the optimal policy is
guaranteed to be unaltered. While such potential-based reward shaping (PBRS)
holds promise, it is limited by the need for a well-defined potential function.
Ideally, we would like to be able to take arbitrary advice from a human or
other agent and improve performance without affecting the optimal policy. The
recently introduced dynamic potential based advice (DPBA) method tackles this
challenge by admitting arbitrary advice from a human or other agent and
improves performance without affecting the optimal policy. The main
contribution of this paper is to expose, theoretically and empirically, a flaw
in DPBA. Alternatively, to achieve the ideal goals, we present a simple method
called policy invariant explicit shaping (PIES) and show theoretically and
empirically that PIES succeeds where DPBA fails.
Related papers
- Hindsight PRIORs for Reward Learning from Human Preferences [3.4990427823966828]
Preference based Reinforcement Learning (PbRL) removes the need to hand specify a reward function by learning a reward from preference feedback over policy behaviors.
Current approaches to PbRL do not address the credit assignment problem inherent in determining which parts of a behavior most contributed to a preference.
We introduce a credit assignment strategy (Hindsight PRIOR) that uses a world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance.
arXiv Detail & Related papers (2024-04-12T21:59:42Z) - Distributional Successor Features Enable Zero-Shot Policy Optimization [36.53356539916603]
This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs)
DiSPOs learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset.
By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions.
arXiv Detail & Related papers (2024-03-10T22:27:21Z) - Adversarial Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
The policy represented by the deep neural network can overfitting, which hamper a reinforcement learning agent from learning effective policy.
Data augmentation can provide a performance boost to RL agents by mitigating the effect of overfitting.
We propose a novel RL algorithm to mitigate the above issue and improve the efficiency of the learned policy.
arXiv Detail & Related papers (2023-04-27T21:01:08Z) - Flexible Attention-Based Multi-Policy Fusion for Efficient Deep
Reinforcement Learning [78.31888150539258]
Reinforcement learning (RL) agents have long sought to approach the efficiency of human learning.
Prior studies in RL have incorporated external knowledge policies to help agents improve sample efficiency.
We present Knowledge-Grounded RL (KGRL), an RL paradigm fusing multiple knowledge policies and aiming for human-like efficiency and flexibility.
arXiv Detail & Related papers (2022-10-07T17:56:57Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Simplifying Deep Reinforcement Learning via Self-Supervision [51.2400839966489]
Self-Supervised Reinforcement Learning (SSRL) is a simple algorithm that optimize policies with purely supervised losses.
We show that SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time.
arXiv Detail & Related papers (2021-06-10T06:29:59Z) - Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible.
In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types.
We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.