Off-policy Confidence Sequences
- URL: http://arxiv.org/abs/2102.09540v1
- Date: Thu, 18 Feb 2021 18:40:30 GMT
- Title: Off-policy Confidence Sequences
- Authors: Nikos Karampatziakis, Paul Mineiro, Aaditya Ramdas
- Abstract summary: We develop confidence bounds that hold uniformly over time for off-policy evaluation in the contextual bandit setting.
We provide algorithms for computing these confidence sequences that strike a good balance between computational and statistical efficiency.
- Score: 33.749904615295485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We develop confidence bounds that hold uniformly over time for off-policy
evaluation in the contextual bandit setting. These confidence sequences are
based on recent ideas from martingale analysis and are non-asymptotic,
non-parametric, and valid at arbitrary stopping times. We provide algorithms
for computing these confidence sequences that strike a good balance between
computational and statistical efficiency. We empirically demonstrate the
tightness of our approach in terms of failure probability and width and apply
it to the "gated deployment" problem of safely upgrading a production
contextual bandit system.
Related papers
- Statistical Inference for Temporal Difference Learning with Linear Function Approximation [62.69448336714418]
Temporal Difference (TD) learning, arguably the most widely used for policy evaluation, serves as a natural framework for this purpose.
In this paper, we study the consistency properties of TD learning with Polyak-Ruppert averaging and linear function approximation, and obtain three significant improvements over existing results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z) - Tighter Confidence Bounds for Sequential Kernel Regression [3.683202928838613]
We use martingale tail inequalities to establish new confidence bounds for sequential kernel regression.
Our confidence bounds can be computed by solving a conic program, although this bare version quickly becomes impractical.
We find that when our confidence bounds replace existing ones, the KernelUCB algorithm has better empirical performance, a matching worst-case performance guarantee and comparable computational cost.
arXiv Detail & Related papers (2024-03-19T13:47:35Z) - High Confidence Level Inference is Almost Free using Parallel Stochastic
Optimization [16.38026811561888]
This paper introduces a novel inference method focused on constructing confidence intervals with efficient computation and fast convergence to the nominal level.
Our method requires minimal additional computation and memory beyond the standard updating of estimates, making the inference process almost cost-free.
arXiv Detail & Related papers (2024-01-17T17:11:45Z) - Likelihood Ratio Confidence Sets for Sequential Decision Making [51.66638486226482]
We revisit the likelihood-based inference principle and propose to use likelihood ratios to construct valid confidence sequences.
Our method is especially suitable for problems with well-specified likelihoods.
We show how to provably choose the best sequence of estimators and shed light on connections to online convex optimization.
arXiv Detail & Related papers (2023-11-08T00:10:21Z) - Improved Algorithms for Stochastic Linear Bandits Using Tail Bounds for
Martingale Mixtures [26.683757807252675]
We present improved algorithms with worst-case regret guarantees for the linear bandit problem.
We show that our confidence sequences are tighter than competitors, both empirically and theoretically.
arXiv Detail & Related papers (2023-09-25T17:13:46Z) - Huber-Robust Confidence Sequences [37.16361789841549]
Confidence sequences are confidence intervals that can be sequentially tracked, and are valid at arbitrary data-dependent stopping times.
We show that the resulting confidence sequences attain the optimal width achieved in the nonsequential setting.
Since confidence sequences are a common tool used within A/B/n testing and bandits, these results open the door to sequential experimentation that is robust to outliers and adversarial corruptions.
arXiv Detail & Related papers (2023-01-23T17:29:26Z) - An evaluation of word-level confidence estimation for end-to-end
automatic speech recognition [70.61280174637913]
We investigate confidence estimation for end-to-end automatic speech recognition (ASR)
We provide an extensive benchmark of popular confidence methods on four well-known speech datasets.
Our results suggest a strong baseline can be obtained by scaling the logits by a learnt temperature.
arXiv Detail & Related papers (2021-01-14T09:51:59Z) - Towards Safe Policy Improvement for Non-Stationary MDPs [48.9966576179679]
Many real-world problems of interest exhibit non-stationarity, and when stakes are high, the cost associated with a false stationarity assumption may be unacceptable.
We take the first steps towards ensuring safety, with high confidence, for smoothly-varying non-stationary decision problems.
Our proposed method extends a type of safe algorithm, called a Seldonian algorithm, through a synthesis of model-free reinforcement learning with time-series analysis.
arXiv Detail & Related papers (2020-10-23T20:13:51Z) - CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning.
We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.