CLARE: Conservative Model-Based Reward Learning for Offline Inverse
Reinforcement Learning
- URL: http://arxiv.org/abs/2302.04782v1
- Date: Thu, 9 Feb 2023 17:16:29 GMT
- Title: CLARE: Conservative Model-Based Reward Learning for Offline Inverse
Reinforcement Learning
- Authors: Sheng Yue, Guanbo Wang, Wei Shao, Zhaofeng Zhang, Sen Lin, Ju Ren,
Junshan Zhang
- Abstract summary: This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL)
We devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating "conservatism" into a learned reward function.
Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy.
- Score: 26.05184273238923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work aims to tackle a major challenge in offline Inverse Reinforcement
Learning (IRL), namely the reward extrapolation error, where the learned reward
function may fail to explain the task correctly and misguide the agent in
unseen environments due to the intrinsic covariate shift. Leveraging both
expert data and lower-quality diverse data, we devise a principled algorithm
(namely CLARE) that solves offline IRL efficiently via integrating
"conservatism" into a learned reward function and utilizing an estimated
dynamics model. Our theoretical analysis provides an upper bound on the return
gap between the learned policy and the expert policy, based on which we
characterize the impact of covariate shift by examining subtle two-tier
tradeoffs between the exploitation (on both expert and diverse data) and
exploration (on the estimated dynamics model). We show that CLARE can provably
alleviate the reward extrapolation error by striking the right
exploitation-exploration balance therein. Extensive experiments corroborate the
significant performance gains of CLARE over existing state-of-the-art
algorithms on MuJoCo continuous control tasks (especially with a small offline
dataset), and the learned reward is highly instructive for further learning.
Related papers
- Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data [17.991833729722288]
We propose a novel policy learning algorithm, PESsimistic CAusal Learning (PESCAL)
Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function.
We provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.
arXiv Detail & Related papers (2024-03-18T14:51:19Z) - A Simple Solution for Offline Imitation from Observations and Examples
with Possibly Incomplete Trajectories [122.11358440078581]
offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable.
We propose Trajectory-Aware Learning from Observations (TAILO) to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available.
arXiv Detail & Related papers (2023-11-02T15:41:09Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Internally Rewarded Reinforcement Learning [22.01249652558878]
We study a class of reinforcement learning problems where the reward signals for policy learning are generated by an internal reward model.
We show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise.
arXiv Detail & Related papers (2023-02-01T06:25:46Z) - Data-Driven Inverse Reinforcement Learning for Expert-Learner Zero-Sum
Games [30.720112378448285]
We formulate inverse reinforcement learning as an expert-learner interaction.
The optimal performance intent of an expert or target agent is unknown to a learner agent.
We develop an off-policy IRL algorithm that does not require knowledge of the expert and learner agent dynamics.
arXiv Detail & Related papers (2023-01-05T10:35:08Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - On Covariate Shift of Latent Confounders in Imitation and Reinforcement
Learning [69.48387059607387]
We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning.
We analyze the limitations of learning from confounded expert data with and without external reward.
We validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks.
arXiv Detail & Related papers (2021-10-13T07:31:31Z) - f-IRL: Inverse Reinforcement Learning via State Marginal Matching [13.100127636586317]
We propose a method for learning the reward function (and the corresponding policy) to match the expert state density.
We present an algorithm, f-IRL, that recovers a stationary reward function from the expert density by gradient descent.
Our method outperforms adversarial imitation learning methods in terms of sample efficiency and the required number of expert trajectories.
arXiv Detail & Related papers (2020-11-09T19:37:48Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z) - Off-Policy Adversarial Inverse Reinforcement Learning [0.0]
Adversarial Imitation Learning (AIL) is a class of algorithms in Reinforcement learning (RL)
We propose an Off-Policy Adversarial Inverse Reinforcement Learning (Off-policy-AIRL) algorithm which is sample efficient as well as gives good imitation performance.
arXiv Detail & Related papers (2020-05-03T16:51:40Z) - DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.