Related papers: Avoiding the Price of Adaptivity: Inference in Linear Contextual Bandits via Stability

Avoiding the Price of Adaptivity: Inference in Linear Contextual Bandits via Stability

URL: http://arxiv.org/abs/2512.20368v1
Date: Tue, 23 Dec 2025 13:53:53 GMT
Title: Avoiding the Price of Adaptivity: Inference in Linear Contextual Bandits via Stability
Authors: Samya Praharaj, Koulik Khamaru,
Abstract summary: We argue that stability and statistical efficiency can coexist within a single contextual bandit method.<n>We show that our algorithm achieves regret guarantees that are minimax optimal up to logarithmic factors.
Score: 2.5782420501870296
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Statistical inference in contextual bandits is complicated by the adaptive, non-i.i.d. nature of the data. A growing body of work has shown that classical least-squares inference may fail under adaptive sampling, and that constructing valid confidence intervals for linear functionals of the model parameter typically requires paying an unavoidable inflation of order $\sqrt{d \log T}$. This phenomenon -- often referred to as the price of adaptivity -- highlights the inherent difficulty of reliable inference under general contextual bandit policies. A key structural property that circumvents this limitation is the \emph{stability} condition of Lai and Wei, which requires the empirical feature covariance to concentrate around a deterministic limit. When stability holds, the ordinary least-squares estimator satisfies a central limit theorem, and classical Wald-type confidence intervals -- designed for i.i.d. data -- become asymptotically valid even under adaptation, \emph{without} incurring the $\sqrt{d \log T}$ price of adaptivity. In this paper, we propose and analyze a penalized EXP4 algorithm for linear contextual bandits. Our first main result shows that this procedure satisfies the Lai--Wei stability condition and therefore admits valid Wald-type confidence intervals for linear functionals. Our second result establishes that the same algorithm achieves regret guarantees that are minimax optimal up to logarithmic factors, demonstrating that stability and statistical efficiency can coexist within a single contextual bandit method. Finally, we complement our theory with simulations illustrating the empirical normality of the resulting estimators and the sharpness of the corresponding confidence intervals.

Related papers

Statistical Inference under Adaptive Sampling with LinUCB [15.167069362020426]
We show that the linear upper confidence bound (LinUCB) algorithm for linear bandits satisfies a property called stability.<n>We establish a central limit theorem for the LinUCB algorithm, establishing normality for the limiting distribution of the estimation error.
arXiv Detail & Related papers (2025-11-28T21:48:18Z)
Optimal Regularization Under Uncertainty: Distributional Robustness and Convexity Constraints [9.77322868877488]
We introduce a framework for distributionally robust optimal regularization.<n>We show how the resulting robust regularizers interpolate between computation of the training distribution and uniform priors.
arXiv Detail & Related papers (2025-10-03T19:35:38Z)
Statistical Inference for Misspecified Contextual Bandits [6.178061357164435]
Contextual bandit algorithms have transformed modern experimentation by enabling real-time adaptation for personalized treatment.<n>Yet these advantages create challenges for statistical inference due to adaptivity.<n> Convergence ensures replicability of adaptive experiments and stability of online algorithms.
arXiv Detail & Related papers (2025-09-08T02:19:37Z)
Pointwise confidence estimation in the non-linear $\ell^2$-regularized least squares [12.352761060862072]
We consider a high-probability non-asymptotic confidence estimation in the $ell2$-regularized non-linear least-squares setting with fixed design.<n>We show a pointwise confidence bound, meaning that it holds for the prediction on any given fixed test input $x$.
arXiv Detail & Related papers (2025-06-08T11:23:49Z)
Statistical Inference for Temporal Difference Learning with Linear Function Approximation [55.80276145563105]
We investigate the statistical properties of Temporal Difference learning with Polyak-Ruppert averaging.<n>We make three theoretical contributions that improve upon the current state-of-the-art results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z)
Likelihood Ratio Confidence Sets for Sequential Decision Making [51.66638486226482]
We revisit the likelihood-based inference principle and propose to use likelihood ratios to construct valid confidence sequences. Our method is especially suitable for problems with well-specified likelihoods. We show how to provably choose the best sequence of estimators and shed light on connections to online convex optimization.
arXiv Detail & Related papers (2023-11-08T00:10:21Z)
When Does Confidence-Based Cascade Deferral Suffice? [69.28314307469381]
Cascades are a classical strategy to enable inference cost to vary adaptively across samples. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. Despite being oblivious to the structure of the cascade, confidence-based deferral often works remarkably well in practice.
arXiv Detail & Related papers (2023-07-06T04:13:57Z)
Inference on Strongly Identified Functionals of Weakly Identified Functions [71.42652863687117]
We study a novel condition for the functional to be strongly identified even when the nuisance function is not. We propose penalized minimax estimators for both the primary and debiasing nuisance functions.
arXiv Detail & Related papers (2022-08-17T13:38:31Z)
Near-optimal inference in adaptive linear regression [60.08422051718195]
Even simple methods like least squares can exhibit non-normal behavior when data is collected in an adaptive manner. We propose a family of online debiasing estimators to correct these distributional anomalies in at least squares estimation. We demonstrate the usefulness of our theory via applications to multi-armed bandit, autoregressive time series estimation, and active learning with exploration.
arXiv Detail & Related papers (2021-07-05T21:05:11Z)
Post-Contextual-Bandit Inference [57.88785630755165]
Contextual bandit algorithms are increasingly replacing non-adaptive A/B tests in e-commerce, healthcare, and policymaking. They can both improve outcomes for study participants and increase the chance of identifying good or even best policies. To support credible inference on novel interventions at the end of the study, we still want to construct valid confidence intervals on average treatment effects, subgroup effects, or value of new policies.
arXiv Detail & Related papers (2021-06-01T12:01:51Z)
CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning. We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z)
Optimal Change-Point Detection with Training Sequences in the Large and Moderate Deviations Regimes [72.68201611113673]
This paper investigates a novel offline change-point detection problem from an information-theoretic perspective. We assume that the knowledge of the underlying pre- and post-change distributions are not known and can only be learned from the training sequences which are available.
arXiv Detail & Related papers (2020-03-13T23:39:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.