Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial
Imitation Learning
- URL: http://arxiv.org/abs/2006.16785v4
- Date: Wed, 25 Oct 2023 13:21:42 GMT
- Title: Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial
Imitation Learning
- Authors: Lionel Blond\'e, Pablo Strasser, Alexandros Kalousis
- Abstract summary: We consider the case of off-policy generative adversarial imitation learning.
We show that forcing the learned reward function to be local Lipschitz-continuous is a sine qua non condition for the method to perform well.
- Score: 52.50288418639075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the recent success of reinforcement learning in various domains,
these approaches remain, for the most part, deterringly sensitive to
hyper-parameters and are often riddled with essential engineering feats
allowing their success. We consider the case of off-policy generative
adversarial imitation learning, and perform an in-depth review, qualitative and
quantitative, of the method. We show that forcing the learned reward function
to be local Lipschitz-continuous is a sine qua non condition for the method to
perform well. We then study the effects of this necessary condition and provide
several theoretical results involving the local Lipschitzness of the
state-value function. We complement these guarantees with empirical evidence
attesting to the strong positive effect that the consistent satisfaction of the
Lipschitzness constraint on the reward has on imitation performance. Finally,
we tackle a generic pessimistic reward preconditioning add-on spawning a large
class of reward shaping methods, which makes the base method it is plugged into
provably more robust, as shown in several additional theoretical guarantees. We
then discuss these through a fine-grained lens and share our insights.
Crucially, the guarantees derived and reported in this work are valid for any
reward satisfying the Lipschitzness condition, nothing is specific to
imitation. As such, these may be of independent interest.
Related papers
- Reward Certification for Policy Smoothed Reinforcement Learning [14.804252729195513]
Reinforcement Learning (RL) has achieved remarkable success in safety-critical areas.
Recent studies have introduced "smoothed policies" in order to enhance its robustness.
It is still challenging to establish a provable guarantee to certify the bound of its total reward.
arXiv Detail & Related papers (2023-12-11T15:07:58Z) - Tight Performance Guarantees of Imitator Policies with Continuous
Actions [45.3190496371625]
We provide theoretical guarantees on the performance of the imitator policy in the case of continuous actions.
We analyze noise injection, a common practice in which the expert action is executed in the environment after the application of a noise kernel.
arXiv Detail & Related papers (2022-12-07T19:32:11Z) - Unpacking Reward Shaping: Understanding the Benefits of Reward
Engineering on Sample Complexity [114.88145406445483]
Reinforcement learning provides an automated framework for learning behaviors from high-level reward specifications.
In practice the choice of reward function can be crucial for good results.
arXiv Detail & Related papers (2022-10-18T04:21:25Z) - Guarantees for Epsilon-Greedy Reinforcement Learning with Function
Approximation [69.1524391595912]
Myopic exploration policies such as epsilon-greedy, softmax, or Gaussian noise fail to explore efficiently in some reinforcement learning tasks.
This paper presents a theoretical analysis of such policies and provides the first regret and sample-complexity bounds for reinforcement learning with myopic exploration.
arXiv Detail & Related papers (2022-06-19T14:44:40Z) - Imitating Past Successes can be Very Suboptimal [145.70788608016755]
We show that existing outcome-conditioned imitation learning methods do not necessarily improve the policy.
We show that a simple modification results in a method that does guarantee policy improvement.
Our aim is not to develop an entirely new method, but rather to explain how a variant of outcome-conditioned imitation learning can be used to maximize rewards.
arXiv Detail & Related papers (2022-06-07T15:13:43Z) - Learning Robust Feedback Policies from Demonstrations [9.34612743192798]
We propose and analyze a new framework to learn feedback control policies that exhibit provable guarantees on the closed-loop performance and robustness to bounded (adversarial) perturbations.
These policies are learned from expert demonstrations without any prior knowledge of the task, its cost function, and system dynamics.
arXiv Detail & Related papers (2021-03-30T19:11:05Z) - Off-Policy Interval Estimation with Lipschitz Value Iteration [29.232245317776723]
We propose a provably correct method for obtaining interval bounds for off-policy evaluation in a general continuous setting.
We go on to introduce a Lipschitz value iteration method to monotonically tighten the interval, which is simple yet efficient and provably convergent.
arXiv Detail & Related papers (2020-10-29T07:25:56Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - Reinforcement Learning with Trajectory Feedback [76.94405309609552]
In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as emphtrajectory feedback.
Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory.
We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret.
arXiv Detail & Related papers (2020-08-13T17:49:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.