Off-Policy Interval Estimation with Lipschitz Value Iteration
- URL: http://arxiv.org/abs/2010.15392v1
- Date: Thu, 29 Oct 2020 07:25:56 GMT
- Title: Off-Policy Interval Estimation with Lipschitz Value Iteration
- Authors: Ziyang Tang, Yihao Feng, Na Zhang, Jian Peng, Qiang Liu
- Abstract summary: We propose a provably correct method for obtaining interval bounds for off-policy evaluation in a general continuous setting.
We go on to introduce a Lipschitz value iteration method to monotonically tighten the interval, which is simple yet efficient and provably convergent.
- Score: 29.232245317776723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy evaluation provides an essential tool for evaluating the effects
of different policies or treatments using only observed data. When applied to
high-stakes scenarios such as medical diagnosis or financial decision-making,
it is crucial to provide provably correct upper and lower bounds of the
expected reward, not just a classical single point estimate, to the end-users,
as executing a poor policy can be very costly. In this work, we propose a
provably correct method for obtaining interval bounds for off-policy evaluation
in a general continuous setting. The idea is to search for the maximum and
minimum values of the expected reward among all the Lipschitz Q-functions that
are consistent with the observations, which amounts to solving a constrained
optimization problem on a Lipschitz function space. We go on to introduce a
Lipschitz value iteration method to monotonically tighten the interval, which
is simple yet efficient and provably convergent. We demonstrate the practical
efficiency of our method on a range of benchmarks.
Related papers
- Every Call is Precious: Global Optimization of Black-Box Functions with Unknown Lipschitz Constants [72.47160195925056]
We introduce Call Precious (P), a novel approach for global optimization of Lipschitz continuous functions.
P eliminates the need to estimate the Lipschitz constant, thereby minimizing additional function.
EC guarantees no-regret performance for evaluation budgets.
arXiv Detail & Related papers (2025-02-06T18:34:40Z) - Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model.
We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z) - Kernel Conditional Moment Constraints for Confounding Robust Inference [22.816690686310714]
We study policy evaluation of offline contextual bandits subject to unobserved confounders.
We propose a general estimator that provides a sharp lower bound of the policy value.
arXiv Detail & Related papers (2023-02-26T16:44:13Z) - Minimax Off-Policy Evaluation for Multi-Armed Bandits [58.7013651350436]
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards.
We develop minimax rate-optimal procedures under three settings.
arXiv Detail & Related papers (2021-01-19T18:55:29Z) - Sparse Feature Selection Makes Batch Reinforcement Learning More Sample
Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation.
When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z) - CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning.
We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z) - Accountable Off-Policy Evaluation With Kernel Bellman Statistics [29.14119984573459]
We consider off-policy evaluation (OPE), which evaluates the performance of a new policy from observed data collected from previous experiments.
Due to the limited information from off-policy data, it is highly desirable to construct rigorous confidence intervals, not just point estimation.
We propose a new variational framework which reduces the problem of calculating tight confidence bounds in OPE.
arXiv Detail & Related papers (2020-08-15T07:24:38Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.