The Virtues of Pessimism in Inverse Reinforcement Learning
- URL: http://arxiv.org/abs/2402.02616v2
- Date: Thu, 8 Feb 2024 16:50:16 GMT
- Title: The Virtues of Pessimism in Inverse Reinforcement Learning
- Authors: David Wu and Gokul Swamy and J. Andrew Bagnell and Zhiwei Steven Wu
and Sanjiban Choudhury
- Abstract summary: Inverse Reinforcement Learning is a powerful framework for learning complex behaviors from expert demonstrations.
It is desirable to reduce the exploration burden by leveraging expert demonstrations in the inner-loop RL.
We consider an alternative approach to speeding up the RL in IRL: emphpessimism, i.e., staying close to the expert's data distribution, instantiated via the use of offline RL algorithms.
- Score: 38.98656220917943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inverse Reinforcement Learning (IRL) is a powerful framework for learning
complex behaviors from expert demonstrations. However, it traditionally
requires repeatedly solving a computationally expensive reinforcement learning
(RL) problem in its inner loop. It is desirable to reduce the exploration
burden by leveraging expert demonstrations in the inner-loop RL. As an example,
recent work resets the learner to expert states in order to inform the learner
of high-reward expert states. However, such an approach is infeasible in the
real world. In this work, we consider an alternative approach to speeding up
the RL subroutine in IRL: \emph{pessimism}, i.e., staying close to the expert's
data distribution, instantiated via the use of offline RL algorithms. We
formalize a connection between offline RL and IRL, enabling us to use an
arbitrary offline RL algorithm to improve the sample efficiency of IRL. We
validate our theory experimentally by demonstrating a strong correlation
between the efficacy of an offline RL algorithm and how well it works as part
of an IRL procedure. By using a strong offline RL algorithm as part of an IRL
procedure, we are able to find policies that match expert performance
significantly more efficiently than the prior art.
Related papers
- Reinforcement Learning with Intrinsically Motivated Feedback Graph for Lost-sales Inventory Control [12.832009040635462]
Reinforcement learning (RL) has proven to be well-performed and general-purpose in the inventory control (IC) domain.
Online experience is expensive to acquire in real-world applications.
Online experience may not reflect the true demand due to the lost sales phenomenon typical in IC.
arXiv Detail & Related papers (2024-06-26T13:52:47Z) - Is Value Learning Really the Main Bottleneck in Offline RL? [70.54708989409409]
We show that the choice of a policy extraction algorithm significantly affects the performance and scalability of offline RL.
We propose two simple test-time policy improvement methods and show that these methods lead to better performance.
arXiv Detail & Related papers (2024-06-13T17:07:49Z) - Hybrid Inverse Reinforcement Learning [34.793570631021005]
inverse reinforcement learning approach to imitation learning is a double-edged sword.
We propose using hybrid RL -- training on a mixture of online and expert data -- to curtail unnecessary exploration.
We derive both model-free and model-based hybrid inverse RL algorithms with strong policy performance guarantees.
arXiv Detail & Related papers (2024-02-13T23:29:09Z) - Is Inverse Reinforcement Learning Harder than Standard Reinforcement
Learning? A Theoretical Perspective [55.36819597141271]
Inverse Reinforcement Learning (IRL) -- the problem of learning reward functions from demonstrations of an emphexpert policy -- plays a critical role in developing intelligent systems.
This paper provides the first line of efficient IRL in vanilla offline and online settings using samples and runtime.
As an application, we show that the learned rewards can emphtransfer to another target MDP with suitable guarantees.
arXiv Detail & Related papers (2023-11-29T00:09:01Z) - Inverse Reinforcement Learning without Reinforcement Learning [40.7783129322142]
Inverse Reinforcement Learning (IRL) aims to learn a reward function that rationalizes expert demonstrations.
Traditional IRL methods require repeatedly solving a hard reinforcement learning problem as a subroutine.
We have reduced the easier problem of imitation learning to repeatedly solving the harder problem of RL.
arXiv Detail & Related papers (2023-03-26T04:35:53Z) - Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning [92.18524491615548]
Contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL)
We study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions.
Under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs.
arXiv Detail & Related papers (2022-07-29T17:29:08Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - RvS: What is Essential for Offline RL via Supervised Learning? [77.91045677562802]
Recent work has shown that supervised learning alone, without temporal difference (TD) learning, can be remarkably effective for offline RL.
In every environment suite we consider simply maximizing likelihood with two-layer feedforward is competitive.
They also probe the limits of existing RvS methods, which are comparatively weak on random data.
arXiv Detail & Related papers (2021-12-20T18:55:16Z) - Heuristic-Guided Reinforcement Learning [31.056460162389783]
Tabula rasa RL algorithms require environment interactions or computation that scales with the horizon of the decision-making task.
Our framework can be viewed as a horizon-based regularization for controlling bias and variance in RL under a finite interaction budget.
In particular, we introduce the novel concept of an "improvable" -- a that allows an RL agent to extrapolate beyond its prior knowledge.
arXiv Detail & Related papers (2021-06-05T00:04:09Z) - FOCAL: Efficient Fully-Offline Meta-Reinforcement Learning via Distance
Metric Learning and Behavior Regularization [10.243908145832394]
We study the offline meta-reinforcement learning (OMRL) problem, a paradigm which enables reinforcement learning (RL) algorithms to quickly adapt to unseen tasks.
This problem is still not fully understood, for which two major challenges need to be addressed.
We provide analysis and insight showing that some simple design choices can yield substantial improvements over recent approaches.
arXiv Detail & Related papers (2020-10-02T17:13:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.