Addressing reward bias in Adversarial Imitation Learning with neutral
reward functions
- URL: http://arxiv.org/abs/2009.09467v1
- Date: Sun, 20 Sep 2020 16:24:10 GMT
- Title: Addressing reward bias in Adversarial Imitation Learning with neutral
reward functions
- Authors: Rohit Jena, Siddharth Agrawal, Katia Sycara
- Abstract summary: Imitation Learning suffers from the fundamental problem of reward bias stemming from the choice of reward functions used in the algorithm.
We provide a theoretical sketch of why existing reward functions would fail in imitation learning scenarios in task based environments with multiple terminal states.
We propose a new reward function for GAIL which outperforms existing GAIL methods on task based environments with single and multiple terminal states.
- Score: 1.7188280334580197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative Adversarial Imitation Learning suffers from the fundamental
problem of reward bias stemming from the choice of reward functions used in the
algorithm. Different types of biases also affect different types of
environments - which are broadly divided into survival and task-based
environments. We provide a theoretical sketch of why existing reward functions
would fail in imitation learning scenarios in task based environments with
multiple terminal states. We also propose a new reward function for GAIL which
outperforms existing GAIL methods on task based environments with single and
multiple terminal states and effectively overcomes both survival and
termination bias.
Related papers
- No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery [53.08822154199948]
Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks.
This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics.
We develop a method that directly trains on scenarios with high learnability.
arXiv Detail & Related papers (2024-08-27T14:31:54Z) - EvIL: Evolution Strategies for Generalisable Imitation Learning [33.745657379141676]
In imitation learning (IL) expert demonstrations and the environment we want to deploy our learned policy in aren't exactly the same.
Compared to policy-centric approaches to IL like cloning, reward-centric approaches like inverse reinforcement learning (IRL) often better replicate expert behaviour in new environments.
We find that modern deep IL algorithms frequently recover rewards which induce policies far weaker than the expert, even in the same environment the demonstrations were collected in.
We propose a novel evolution-strategies based method EvIL to optimise for a reward-shaping term that speeds up re-training in the target environment.
arXiv Detail & Related papers (2024-06-15T22:46:39Z) - Expert Proximity as Surrogate Rewards for Single Demonstration Imitation Learning [51.972577689963714]
Single-demonstration imitation learning (IL) is a practical approach for real-world applications where acquiring multiple expert demonstrations is costly or infeasible.
In contrast to typical IL settings, single-demonstration IL involves an agent having access to only one expert trajectory.
We highlight the issue of sparse reward signals in this setting and propose to mitigate this issue through our proposed Transition Discriminator-based IL (TDIL) method.
arXiv Detail & Related papers (2024-02-01T23:06:19Z) - Behavior Alignment via Reward Function Optimization [23.92721220310242]
We introduce a new framework that integrates auxiliary rewards reflecting a designer's domain knowledge with the environment's primary rewards.
We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges.
arXiv Detail & Related papers (2023-10-29T13:45:07Z) - Transferable Reward Learning by Dynamics-Agnostic Discriminator Ensemble [8.857776147129464]
Recovering reward function from expert demonstrations is a fundamental problem in reinforcement learning.
We present a dynamics-agnostic discriminator-ensemble reward learning method capable of learning both state-action and state-only reward functions.
arXiv Detail & Related papers (2022-06-01T05:16:39Z) - Multi-Environment Meta-Learning in Stochastic Linear Bandits [49.387421094105136]
We consider the feasibility of meta-learning when task parameters are drawn from a mixture distribution instead of a single environment.
We propose a regularized version of the OFUL algorithm that achieves low regret on a new task without requiring knowledge of the environment from which the new task originates.
arXiv Detail & Related papers (2022-05-12T19:31:28Z) - Invariance in Policy Optimisation and Partial Identifiability in Reward
Learning [67.4640841144101]
We characterise the partial identifiability of the reward function given popular reward learning data sources.
We also analyse the impact of this partial identifiability for several downstream tasks, such as policy optimisation.
arXiv Detail & Related papers (2022-03-14T20:19:15Z) - Reward function shape exploration in adversarial imitation learning: an
empirical study [9.817069267241575]
In adversarial imitation learning algorithms (AILs), no true rewards are obtained from the environment for learning the strategy.
We design several representative reward function shapes and compare their performances by large-scale experiments.
arXiv Detail & Related papers (2021-04-14T08:21:49Z) - Replacing Rewards with Examples: Example-Based Policy Search via
Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function.
In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved.
Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z) - Demonstration-efficient Inverse Reinforcement Learning in Procedurally
Generated Environments [137.86426963572214]
Inverse Reinforcement Learning can extrapolate reward functions from expert demonstrations.
We show that our approach, DE-AIRL, is demonstration-efficient and still able to extrapolate reward functions which generalize to the fully procedural domain.
arXiv Detail & Related papers (2020-12-04T11:18:02Z) - Reinforcement Learning with Goal-Distance Gradient [1.370633147306388]
Reinforcement learning usually uses the feedback rewards of environmental to train agents.
Most of the current methods are difficult to get good performance in sparse reward or non-reward environments.
We propose a model-free method that does not rely on environmental rewards to solve the problem of sparse rewards in the general environment.
arXiv Detail & Related papers (2020-01-01T02:37:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.