Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies
- URL: http://arxiv.org/abs/2006.03900v1
- Date: Sat, 6 Jun 2020 15:52:05 GMT
- Title: Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies
- Authors: Nathan Kallus, Masatoshi Uehara
- Abstract summary: We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
- Score: 80.42316902296832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline reinforcement learning, wherein one uses off-policy data logged by a
fixed behavior policy to evaluate and learn new policies, is crucial in
applications where experimentation is limited such as medicine. We study the
estimation of policy value and gradient of a deterministic policy from
off-policy data when actions are continuous. Targeting deterministic policies,
for which action is a deterministic function of state, is crucial since optimal
policies are always deterministic (up to ties). In this setting, standard
importance sampling and doubly robust estimators for policy value and gradient
fail because the density ratio does not exist. To circumvent this issue, we
propose several new doubly robust estimators based on different kernelization
approaches. We analyze the asymptotic mean-squared error of each of these under
mild rate conditions for nuisance estimators. Specifically, we demonstrate how
to obtain a rate that is independent of the horizon length.
Related papers
- Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies [24.706986328622193]
We consider off-policy evaluation of deterministic target policies for reinforcement learning.
We learn the kernel metrics that minimize the overall mean squared error of the estimated temporal difference update vector of an action value function.
We derive the bias and variance of the estimation error due to this relaxation and provide analytic solutions for the optimal kernel metric.
arXiv Detail & Related papers (2024-05-29T06:17:33Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z) - Statistically Efficient Off-Policy Policy Gradients [80.42316902296832]
We consider the statistically efficient estimation of policy gradients from off-policy data.
We propose a meta-algorithm that achieves the lower bound without any parametric assumptions.
We establish guarantees on the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient.
arXiv Detail & Related papers (2020-02-10T18:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.