Estimating Optimal Policy Value in General Linear Contextual Bandits
- URL: http://arxiv.org/abs/2302.09451v1
- Date: Sun, 19 Feb 2023 01:09:24 GMT
- Title: Estimating Optimal Policy Value in General Linear Contextual Bandits
- Authors: Jonathan N. Lee, Weihao Kong, Aldo Pacchiano, Vidya Muthukumar, Emma
Brunskill
- Abstract summary: In many bandit problems, the maximal reward achievable by a policy is often unknown in advance.
We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable.
We present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V*$.
- Score: 50.008542459050155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many bandit problems, the maximal reward achievable by a policy is often
unknown in advance. We consider the problem of estimating the optimal policy
value in the sublinear data regime before the optimal policy is even learnable.
We refer to this as $V^*$ estimation. It was recently shown that fast $V^*$
estimation is possible but only in disjoint linear bandits with Gaussian
covariates. Whether this is possible for more realistic context distributions
has remained an open and important question for tasks such as model selection.
In this paper, we first provide lower bounds showing that this general problem
is hard. However, under stronger assumptions, we give an algorithm and analysis
proving that $\widetilde{\mathcal{O}}(\sqrt{d})$ sublinear estimation of $V^*$
is indeed information-theoretically possible, where $d$ is the dimension. We
then present a more practical, computationally efficient algorithm that
estimates a problem-dependent upper bound on $V^*$ that holds for general
distributions and is tight when the context distribution is Gaussian. We prove
our algorithm requires only $\widetilde{\mathcal{O}}(\sqrt{d})$ samples to
estimate the upper bound. We use this upper bound and the estimator to obtain
novel and improved guarantees for several applications in bandit model
selection and testing for treatment effects.
Related papers
- Model Predictive Control is Almost Optimal for Restless Bandit [2.295863158976069]
We consider the discrete time infinite horizon average reward restless markovian bandit (RMAB) problem.
We propose a emphmodel predictive control based non-stationary policy with a rolling computational horizon $tau$.
We show that its sub-optimality gap is $O(1/sqrtN)$ in general, and $exp(-Omega(N))$ under a local-stability condition.
arXiv Detail & Related papers (2024-10-08T19:34:23Z) - Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Robust Sparse Estimation for Gaussians with Optimal Error under Huber Contamination [42.526664955704746]
We study sparse estimation tasks in Huber's contamination model with a focus on mean estimation, PCA, and linear regression.
For each of these tasks, we give the first sample and computationally efficient robust estimators with optimal error guarantees.
At the technical level, we develop a novel multidimensional filtering method in the sparse regime that may find other applications.
arXiv Detail & Related papers (2024-03-15T15:51:27Z) - Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery [45.601316850669406]
We present efficient algorithms for policy evaluation, best policy identification and regret minimization.
For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal.
All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix.
arXiv Detail & Related papers (2024-02-24T06:36:08Z) - Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework.
To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework.
We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
arXiv Detail & Related papers (2024-01-14T14:14:19Z) - Optimality in Mean Estimation: Beyond Worst-Case, Beyond Sub-Gaussian,
and Beyond $1+\alpha$ Moments [10.889739958035536]
We introduce a new definitional framework to analyze the fine-grained optimality of algorithms.
We show that median-of-means is neighborhood optimal, up to constant factors.
It is open to find a neighborhood-separated estimator without constant factor slackness.
arXiv Detail & Related papers (2023-11-21T18:50:38Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.