Tight Performance Guarantees of Imitator Policies with Continuous
Actions
- URL: http://arxiv.org/abs/2212.03922v1
- Date: Wed, 7 Dec 2022 19:32:11 GMT
- Title: Tight Performance Guarantees of Imitator Policies with Continuous
Actions
- Authors: Davide Maran, Alberto Maria Metelli, Marcello Restelli
- Abstract summary: We provide theoretical guarantees on the performance of the imitator policy in the case of continuous actions.
We analyze noise injection, a common practice in which the expert action is executed in the environment after the application of a noise kernel.
- Score: 45.3190496371625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Behavioral Cloning (BC) aims at learning a policy that mimics the behavior
demonstrated by an expert. The current theoretical understanding of BC is
limited to the case of finite actions. In this paper, we study BC with the goal
of providing theoretical guarantees on the performance of the imitator policy
in the case of continuous actions. We start by deriving a novel bound on the
performance gap based on Wasserstein distance, applicable for continuous-action
experts, holding under the assumption that the value function is Lipschitz
continuous. Since this latter condition is hardy fulfilled in practice, even
for Lipschitz Markov Decision Processes and policies, we propose a relaxed
setting, proving that value function is always Holder continuous. This result
is of independent interest and allows obtaining in BC a general bound for the
performance of the imitator policy. Finally, we analyze noise injection, a
common practice in which the expert action is executed in the environment after
the application of a noise kernel. We show that this practice allows deriving
stronger performance guarantees, at the price of a bias due to the noise
addition.
Related papers
- Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization [39.740287682191884]
In robust Markov decision processes (RMDPs) it is assumed that the reward and the transition dynamics lie in a given uncertainty set.
This so-called rectangularity condition is solely motivated by computational concerns.
We introduce a policy-gradient method and prove its convergence.
arXiv Detail & Related papers (2023-09-03T07:34:26Z) - Wasserstein Actor-Critic: Directed Exploration via Optimism for
Continuous-Actions Control [41.7453231409493]
Wasserstein Actor-Critic ( WAC) is an actor-critic architecture inspired by the Wasserstein Q-Learning (WQL) citepwql.
WAC enforces exploration in a principled way by guiding the policy learning process with the optimization of an upper bound of the Q-value estimates.
arXiv Detail & Related papers (2023-03-04T10:52:20Z) - Hallucinated Adversarial Control for Conservative Offline Policy
Evaluation [64.94009515033984]
We study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, we seek to obtain a (tight) lower bound on a policy's performance.
We introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics.
We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return.
arXiv Detail & Related papers (2023-03-02T08:57:35Z) - Kernel Conditional Moment Constraints for Confounding Robust Inference [22.816690686310714]
We study policy evaluation of offline contextual bandits subject to unobserved confounders.
We propose a general estimator that provides a sharp lower bound of the policy value.
arXiv Detail & Related papers (2023-02-26T16:44:13Z) - Anytime-valid off-policy inference for contextual bandits [34.721189269616175]
Contextual bandit algorithms map observed contexts $X_t$ to actions $A_t$ over time.
It is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data.
We present a comprehensive framework for OPE inference that relax unnecessary conditions made in some past works.
arXiv Detail & Related papers (2022-10-19T17:57:53Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Learning Robust Feedback Policies from Demonstrations [9.34612743192798]
We propose and analyze a new framework to learn feedback control policies that exhibit provable guarantees on the closed-loop performance and robustness to bounded (adversarial) perturbations.
These policies are learned from expert demonstrations without any prior knowledge of the task, its cost function, and system dynamics.
arXiv Detail & Related papers (2021-03-30T19:11:05Z) - Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial
Imitation Learning [52.50288418639075]
We consider the case of off-policy generative adversarial imitation learning.
We show that forcing the learned reward function to be local Lipschitz-continuous is a sine qua non condition for the method to perform well.
arXiv Detail & Related papers (2020-06-28T20:55:31Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.