Related papers: Off-Policy Interval Estimation with Lipschitz Value Iteration

Off-Policy Interval Estimation with Lipschitz Value Iteration

URL: http://arxiv.org/abs/2010.15392v1
Date: Thu, 29 Oct 2020 07:25:56 GMT
Title: Off-Policy Interval Estimation with Lipschitz Value Iteration
Authors: Ziyang Tang, Yihao Feng, Na Zhang, Jian Peng, Qiang Liu
Abstract summary: We propose a provably correct method for obtaining interval bounds for off-policy evaluation in a general continuous setting. We go on to introduce a Lipschitz value iteration method to monotonically tighten the interval, which is simple yet efficient and provably convergent.
Score: 29.232245317776723
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Off-policy evaluation provides an essential tool for evaluating the effects of different policies or treatments using only observed data. When applied to high-stakes scenarios such as medical diagnosis or financial decision-making, it is crucial to provide provably correct upper and lower bounds of the expected reward, not just a classical single point estimate, to the end-users, as executing a poor policy can be very costly. In this work, we propose a provably correct method for obtaining interval bounds for off-policy evaluation in a general continuous setting. The idea is to search for the maximum and minimum values of the expected reward among all the Lipschitz Q-functions that are consistent with the observations, which amounts to solving a constrained optimization problem on a Lipschitz function space. We go on to introduce a Lipschitz value iteration method to monotonically tighten the interval, which is simple yet efficient and provably convergent. We demonstrate the practical efficiency of our method on a range of benchmarks.

Related papers

An $(ε,δ)$-accurate level set estimation with a stopping criterion [2.5124069610074895]
This paper introduces an acquisition strategy for level set estimation that incorporates a stopping criterion. We show that our method satisfies $epsilon$-accuracy with a confidence level of $1 - delta$, addressing a key gap in existing approaches.
arXiv Detail & Related papers (2025-03-26T06:41:07Z)
Every Call is Precious: Global Optimization of Black-Box Functions with Unknown Lipschitz Constants [72.47160195925056]
We introduce Call Precious (P), a novel approach for global optimization of Lipschitz continuous functions. P eliminates the need to estimate the Lipschitz constant, thereby minimizing additional function. EC guarantees no-regret performance for evaluation budgets.
arXiv Detail & Related papers (2025-02-06T18:34:40Z)
Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model. We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z)
Kernel Conditional Moment Constraints for Confounding Robust Inference [22.816690686310714]
We study policy evaluation of offline contextual bandits subject to unobserved confounders. We propose a general estimator that provides a sharp lower bound of the policy value.
arXiv Detail & Related papers (2023-02-26T16:44:13Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
Minimax Off-Policy Evaluation for Multi-Armed Bandits [58.7013651350436]
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards. We develop minimax rate-optimal procedures under three settings.
arXiv Detail & Related papers (2021-01-19T18:55:29Z)
Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation. When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z)
CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning. We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z)
Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function. We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z)
Accountable Off-Policy Evaluation With Kernel Bellman Statistics [29.14119984573459]
We consider off-policy evaluation (OPE), which evaluates the performance of a new policy from observed data collected from previous experiments. Due to the limited information from off-policy data, it is highly desirable to construct rigorous confidence intervals, not just point estimation. We propose a new variational framework which reduces the problem of calculating tight confidence bounds in OPE.
arXiv Detail & Related papers (2020-08-15T07:24:38Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.