Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
- URL: http://arxiv.org/abs/2002.02081v6
- Date: Wed, 4 Nov 2020 23:43:32 GMT
- Title: Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
- Authors: Nan Jiang, Jiawei Huang
- Abstract summary: We study minimax methods for off-policy evaluation using value functions and marginalized importance weights.
Despite that they hold promises of overcoming the exponential variance in traditional importance sampling, several key problems remain.
For the sake of trustworthy OPE, is there anyway to quantify the biases?
- Score: 28.085288472120705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study minimax methods for off-policy evaluation (OPE) using value
functions and marginalized importance weights. Despite that they hold promises
of overcoming the exponential variance in traditional importance sampling,
several key problems remain:
(1) They require function approximation and are generally biased. For the
sake of trustworthy OPE, is there anyway to quantify the biases?
(2) They are split into two styles ("weight-learning" vs "value-learning").
Can we unify them?
In this paper we answer both questions positively. By slightly altering the
derivation of previous methods (one from each style; Uehara et al., 2020), we
unify them into a single value interval that comes with a special type of
double robustness: when either the value-function or the importance-weight
class is well specified, the interval is valid and its length quantifies the
misspecification of the other class. Our interval also provides a unified view
of and new insights to some recent methods, and we further explore the
implications of our results on exploration and exploitation in off-policy
policy optimization with insufficient data coverage.
Related papers
- Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Anytime-valid off-policy inference for contextual bandits [34.721189269616175]
Contextual bandit algorithms map observed contexts $X_t$ to actions $A_t$ over time.
It is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data.
We present a comprehensive framework for OPE inference that relax unnecessary conditions made in some past works.
arXiv Detail & Related papers (2022-10-19T17:57:53Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Universal Off-Policy Evaluation [64.02853483874334]
We take the first steps towards a universal off-policy estimator (UnO)
We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns.
arXiv Detail & Related papers (2021-04-26T18:54:31Z) - Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and
Dual Bounds [21.520045697447372]
Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies.
This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation.
We develop a practical algorithm through a primal-dual optimization-based approach.
arXiv Detail & Related papers (2021-03-09T22:31:20Z) - Minimax Off-Policy Evaluation for Multi-Armed Bandits [58.7013651350436]
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards.
We develop minimax rate-optimal procedures under three settings.
arXiv Detail & Related papers (2021-01-19T18:55:29Z) - Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with
Latent Confounders [62.54431888432302]
We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders.
We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data.
arXiv Detail & Related papers (2020-07-27T22:19:01Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.