Off-Policy Interval Estimation with Lipschitz Value Iteration
- URL: http://arxiv.org/abs/2010.15392v1
- Date: Thu, 29 Oct 2020 07:25:56 GMT
- Title: Off-Policy Interval Estimation with Lipschitz Value Iteration
- Authors: Ziyang Tang, Yihao Feng, Na Zhang, Jian Peng, Qiang Liu
- Abstract summary: We propose a provably correct method for obtaining interval bounds for off-policy evaluation in a general continuous setting.
We go on to introduce a Lipschitz value iteration method to monotonically tighten the interval, which is simple yet efficient and provably convergent.
- Score: 29.232245317776723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy evaluation provides an essential tool for evaluating the effects
of different policies or treatments using only observed data. When applied to
high-stakes scenarios such as medical diagnosis or financial decision-making,
it is crucial to provide provably correct upper and lower bounds of the
expected reward, not just a classical single point estimate, to the end-users,
as executing a poor policy can be very costly. In this work, we propose a
provably correct method for obtaining interval bounds for off-policy evaluation
in a general continuous setting. The idea is to search for the maximum and
minimum values of the expected reward among all the Lipschitz Q-functions that
are consistent with the observations, which amounts to solving a constrained
optimization problem on a Lipschitz function space. We go on to introduce a
Lipschitz value iteration method to monotonically tighten the interval, which
is simple yet efficient and provably convergent. We demonstrate the practical
efficiency of our method on a range of benchmarks.
Related papers
- Kernel Conditional Moment Constraints for Confounding Robust Inference [22.816690686310714]
We study policy evaluation of offline contextual bandits subject to unobserved confounders.
We propose a general estimator that provides a sharp lower bound of the policy value.
arXiv Detail & Related papers (2023-02-26T16:44:13Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Minimax Off-Policy Evaluation for Multi-Armed Bandits [58.7013651350436]
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards.
We develop minimax rate-optimal procedures under three settings.
arXiv Detail & Related papers (2021-01-19T18:55:29Z) - Sparse Feature Selection Makes Batch Reinforcement Learning More Sample
Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation.
When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z) - CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning.
We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - Accountable Off-Policy Evaluation With Kernel Bellman Statistics [29.14119984573459]
We consider off-policy evaluation (OPE), which evaluates the performance of a new policy from observed data collected from previous experiments.
Due to the limited information from off-policy data, it is highly desirable to construct rigorous confidence intervals, not just point estimation.
We propose a new variational framework which reduces the problem of calculating tight confidence bounds in OPE.
arXiv Detail & Related papers (2020-08-15T07:24:38Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.