Is Inverse Reinforcement Learning Harder than Standard Reinforcement
Learning? A Theoretical Perspective
- URL: http://arxiv.org/abs/2312.00054v2
- Date: Sat, 10 Feb 2024 07:38:32 GMT
- Title: Is Inverse Reinforcement Learning Harder than Standard Reinforcement
Learning? A Theoretical Perspective
- Authors: Lei Zhao, Mengdi Wang, Yu Bai
- Abstract summary: Inverse Reinforcement Learning (IRL) -- the problem of learning reward functions from demonstrations of an emphexpert policy -- plays a critical role in developing intelligent systems.
This paper provides the first line of efficient IRL in vanilla offline and online settings using samples and runtime.
As an application, we show that the learned rewards can emphtransfer to another target MDP with suitable guarantees.
- Score: 55.36819597141271
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inverse Reinforcement Learning (IRL) -- the problem of learning reward
functions from demonstrations of an \emph{expert policy} -- plays a critical
role in developing intelligent systems. While widely used in applications,
theoretical understandings of IRL present unique challenges and remain less
developed compared with standard RL. For example, it remains open how to do IRL
efficiently in standard \emph{offline} settings with pre-collected data, where
states are obtained from a \emph{behavior policy} (which could be the expert
policy itself), and actions are sampled from the expert policy.
This paper provides the first line of results for efficient IRL in vanilla
offline and online settings using polynomial samples and runtime. Our
algorithms and analyses seamlessly adapt the pessimism principle commonly used
in offline RL, and achieve IRL guarantees in stronger metrics than considered
in existing work. We provide lower bounds showing that our sample complexities
are nearly optimal. As an application, we also show that the learned rewards
can \emph{transfer} to another target MDP with suitable guarantees when the
target MDP satisfies certain similarity assumptions with the original (source)
MDP.
Related papers
- Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching [23.600285251963395]
In inverse reinforcement learning (IRL), an agent seeks to replicate expert demonstrations through interactions with the environment.
Traditionally, IRL is treated as an adversarial game, where an adversary searches over reward models, and a learner optimize the reward through repeated RL procedures.
We propose a novel approach to IRL by direct policy optimization, exploiting a linear factorization of the return as the inner product of successor features and a reward vector.
arXiv Detail & Related papers (2024-11-11T14:05:50Z) - Statistical Guarantees for Lifelong Reinforcement Learning using PAC-Bayesian Theory [37.02104729448692]
EPIC is a novel algorithm designed for lifelong reinforcement learning.
It learns a shared policy distribution, referred to as the textitworld policy, which enables rapid adaptation to new tasks.
Experiments on a variety of environments demonstrate that EPIC significantly outperforms existing methods in lifelong RL.
arXiv Detail & Related papers (2024-11-01T07:01:28Z) - Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.
Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution.
We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z) - The Virtues of Pessimism in Inverse Reinforcement Learning [38.98656220917943]
Inverse Reinforcement Learning is a powerful framework for learning complex behaviors from expert demonstrations.
It is desirable to reduce the exploration burden by leveraging expert demonstrations in the inner-loop RL.
We consider an alternative approach to speeding up the RL in IRL: emphpessimism, i.e., staying close to the expert's data distribution, instantiated via the use of offline RL algorithms.
arXiv Detail & Related papers (2024-02-04T21:22:29Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.