Related papers: PAGAR: Taming Reward Misalignment in Inverse Reinforcement Learning-Based Imitation Learning with Protagonist Antagonist Guided Adversarial Reward

PAGAR: Taming Reward Misalignment in Inverse Reinforcement Learning-Based Imitation Learning with Protagonist Antagonist Guided Adversarial Reward

URL: http://arxiv.org/abs/2306.01731v3
Date: Wed, 7 Feb 2024 18:41:12 GMT
Title: PAGAR: Taming Reward Misalignment in Inverse Reinforcement Learning-Based Imitation Learning with Protagonist Antagonist Guided Adversarial Reward
Authors: Weichao Zhou, Wenchao Li
Abstract summary: We introduce a semi-supervised reward design paradigm called Protagonist Antagonist Guided Adrial Reward (PAGAR) PAGAR-based IL trains a policy to perform well under mixed reward functions instead of a single reward function as in IRL-based IL. We show that our algorithm outperforms standard IL baselines in complex tasks and challenging transfer settings.
Score: 8.83374617444803
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many imitation learning (IL) algorithms employ inverse reinforcement learning (IRL) to infer the intrinsic reward function that an expert is implicitly optimizing for based on their demonstrated behaviors. However, in practice, IRL-based IL can fail to accomplish the underlying task due to a misalignment between the inferred reward and the objective of the task. In this paper, we address the susceptibility of IL to such misalignment by introducing a semi-supervised reward design paradigm called Protagonist Antagonist Guided Adversarial Reward (PAGAR). PAGAR-based IL trains a policy to perform well under mixed reward functions instead of a single reward function as in IRL-based IL. We identify the theoretical conditions under which PAGAR-based IL can avoid the task failures caused by reward misalignment. We also present a practical on-and-off policy approach to implementing PAGAR-based IL. Experimental results show that our algorithm outperforms standard IL baselines in complex tasks and challenging transfer settings.

Related papers

Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching [23.600285251963395]
In inverse reinforcement learning (IRL), an agent seeks to replicate expert demonstrations through interactions with the environment. Traditionally, IRL is treated as an adversarial game, where an adversary searches over reward models, and a learner optimize the reward through repeated RL procedures. We propose a novel approach to IRL by direct policy optimization, exploiting a linear factorization of the return as the inner product of successor features and a reward vector.
arXiv Detail & Related papers (2024-11-11T14:05:50Z)
Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment [7.477559660351106]
imitation learning (IL) algorithms use inverse reinforcement learning (IRL) to infer a reward function that aligns with a demonstration. We propose a novel framework for IRL-based IL that prioritizes task alignment over conventional data alignment.
arXiv Detail & Related papers (2024-10-31T07:08:14Z)
RILe: Reinforced Imitation Learning [60.63173816209543]
RILe is a novel trainer-student system that learns a dynamic reward function based on the student's performance and alignment with expert demonstrations. RILe enables better performance in complex settings where traditional methods falter, outperforming existing methods by 2x in complex simulated robot-locomotion tasks.
arXiv Detail & Related papers (2024-06-12T17:56:31Z)
Expert Proximity as Surrogate Rewards for Single Demonstration Imitation Learning [51.972577689963714]
Single-demonstration imitation learning (IL) is a practical approach for real-world applications where acquiring multiple expert demonstrations is costly or infeasible. In contrast to typical IL settings, single-demonstration IL involves an agent having access to only one expert trajectory. We highlight the issue of sparse reward signals in this setting and propose to mitigate this issue through our proposed Transition Discriminator-based IL (TDIL) method.
arXiv Detail & Related papers (2024-02-01T23:06:19Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world. Recent methods aim to mitigate misalignment by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Behavior Alignment via Reward Function Optimization [23.92721220310242]
We introduce a new framework that integrates auxiliary rewards reflecting a designer's domain knowledge with the environment's primary rewards. We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges.
arXiv Detail & Related papers (2023-10-29T13:45:07Z)
STARC: A General Framework For Quantifying Differences Between Reward Functions [52.69620361363209]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret. We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z)
CLARE: Conservative Model-Based Reward Learning for Offline Inverse Reinforcement Learning [26.05184273238923]
This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL) We devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating "conservatism" into a learned reward function. Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy.
arXiv Detail & Related papers (2023-02-09T17:16:29Z)
Internally Rewarded Reinforcement Learning [22.01249652558878]
We study a class of reinforcement learning problems where the reward signals for policy learning are generated by an internal reward model. We show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise.
arXiv Detail & Related papers (2023-02-01T06:25:46Z)
Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior. This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z)
Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework. To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z)
Regularized Inverse Reinforcement Learning [49.78352058771138]
Inverse Reinforcement Learning (IRL) aims to facilitate a learner's ability to imitate expert behavior. Regularized IRL applies strongly convex regularizers to the learner's policy. We propose tractable solutions, and practical methods to obtain them, for regularized IRL.
arXiv Detail & Related papers (2020-10-07T23:38:47Z)
Off-Policy Adversarial Inverse Reinforcement Learning [0.0]
Adversarial Imitation Learning (AIL) is a class of algorithms in Reinforcement learning (RL) We propose an Off-Policy Adversarial Inverse Reinforcement Learning (Off-policy-AIRL) algorithm which is sample efficient as well as gives good imitation performance.
arXiv Detail & Related papers (2020-05-03T16:51:40Z)
Corruption-robust exploration in episodic reinforcement learning [76.19192549843727]
We study multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system. Our framework yields efficient algorithms which attain near-optimal regret in the absence of corruptions. Notably, our work provides the first sublinear regret guarantee which any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning.
arXiv Detail & Related papers (2019-11-20T03:49:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.