Stochastic Bandits with Linear Constraints
- URL: http://arxiv.org/abs/2006.10185v1
- Date: Wed, 17 Jun 2020 22:32:19 GMT
- Title: Stochastic Bandits with Linear Constraints
- Authors: Aldo Pacchiano, Mohammad Ghavamzadeh, Peter Bartlett, Heinrich Jiang
- Abstract summary: We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
- Score: 69.757694218456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study a constrained contextual linear bandit setting, where the goal of
the agent is to produce a sequence of policies, whose expected cumulative
reward over the course of $T$ rounds is maximum, and each has an expected cost
below a certain threshold $\tau$. We propose an upper-confidence bound
algorithm for this problem, called optimistic pessimistic linear bandit (OPLB),
and prove an $\widetilde{\mathcal{O}}(\frac{d\sqrt{T}}{\tau-c_0})$ bound on its
$T$-round regret, where the denominator is the difference between the
constraint threshold and the cost of a known feasible action. We further
specialize our results to multi-armed bandits and propose a computationally
efficient algorithm for this setting. We prove a regret bound of
$\widetilde{\mathcal{O}}(\frac{\sqrt{KT}}{\tau - c_0})$ for this algorithm in
$K$-armed bandits, which is a $\sqrt{K}$ improvement over the regret bound we
obtain by simply casting multi-armed bandits as an instance of contextual
linear bandits and using the regret bound of OPLB. We also prove a lower-bound
for the problem studied in the paper and provide simulations to validate our
theoretical results.
Related papers
- Tangential Randomization in Linear Bandits (TRAiL): Guaranteed Inference and Regret Bounds [1.03590082373586]
We propose and analyze TRAiL, a regret-optimal forced exploration algorithm for linear bandits.
TraiL ensures a $Omega(sqrtT)$ growth in the inference quality, measured via the minimum eigenvalue of the design (regressor) matrix.
We characterize an $Omega(sqrtT)$ minimax lower bound for any algorithm on the expected regret.
arXiv Detail & Related papers (2024-11-19T01:08:13Z) - Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework.
To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework.
We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
arXiv Detail & Related papers (2024-01-14T14:14:19Z) - Bypassing the Simulator: Near-Optimal Adversarial Linear Contextual
Bandits [30.337826346496385]
We consider the adversarial linear contextual bandit problem, where the loss vectors are selected fully adversarially and the per-round action set is drawn from a fixed distribution.
Existing methods for this problem either require access to a simulator to generate free i.i.d. contexts, achieve a sub-optimal regret, or are computationally inefficient.
We greatly improve these results by achieving a regret of $widetildeO(sqrtT)$ without a simulator, maintaining computational efficiency when the action set in each round is small.
arXiv Detail & Related papers (2023-09-02T03:49:05Z) - Communication-Constrained Bandits under Additive Gaussian Noise [111.06688156723018]
We study a distributed multi-armed bandit where a client supplies the learner with communication-constrained feedback.
We propose a multi-phase bandit algorithm, $mathttUEtext-UCB++$, which matches this lower bound to a minor additive factor.
arXiv Detail & Related papers (2023-04-25T09:31:20Z) - Squeeze All: Novel Estimator and Self-Normalized Bound for Linear
Contextual Bandits [18.971564419292893]
We propose a linear contextual bandit algorithm with $O(sqrtdTlog T)$ regret bound, where $d$ is the dimension of contexts and $T$ is the time horizon.
Our proposed algorithm is equipped with a novel estimator in which exploration is embedded through explicit randomization.
arXiv Detail & Related papers (2022-06-11T02:43:17Z) - Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.
We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret
We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z) - Nearly Minimax Algorithms for Linear Bandits with Shared Representation [86.79657561369397]
We consider the setting where we play $M$ linear bandits with dimension $d$, each for $T$ rounds, and these $M$ bandit tasks share a common $k(ll d)$ dimensional linear representation.
We come up with novel algorithms that achieve $widetildeOleft(dsqrtkMT + kMsqrtTright)$ regret bounds, which matches the known minimax regret lower bound up to logarithmic factors.
arXiv Detail & Related papers (2022-03-29T15:27:13Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.