Communication-Constrained Bandits under Additive Gaussian Noise
- URL: http://arxiv.org/abs/2304.12680v2
- Date: Tue, 6 Jun 2023 10:22:56 GMT
- Title: Communication-Constrained Bandits under Additive Gaussian Noise
- Authors: Prathamesh Mayekar, Jonathan Scarlett, and Vincent Y.F. Tan
- Abstract summary: We study a distributed multi-armed bandit where a client supplies the learner with communication-constrained feedback.
We propose a multi-phase bandit algorithm, $mathttUEtext-UCB++$, which matches this lower bound to a minor additive factor.
- Score: 111.06688156723018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study a distributed stochastic multi-armed bandit where a client supplies
the learner with communication-constrained feedback based on the rewards for
the corresponding arm pulls. In our setup, the client must encode the rewards
such that the second moment of the encoded rewards is no more than $P$, and
this encoded reward is further corrupted by additive Gaussian noise of variance
$\sigma^2$; the learner only has access to this corrupted reward. For this
setting, we derive an information-theoretic lower bound of
$\Omega\left(\sqrt{\frac{KT}{\mathtt{SNR} \wedge1}} \right)$ on the minimax
regret of any scheme, where $ \mathtt{SNR} := \frac{P}{\sigma^2}$, and $K$ and
$T$ are the number of arms and time horizon, respectively. Furthermore, we
propose a multi-phase bandit algorithm, $\mathtt{UE\text{-}UCB++}$, which
matches this lower bound to a minor additive factor. $\mathtt{UE\text{-}UCB++}$
performs uniform exploration in its initial phases and then utilizes the {\em
upper confidence bound }(UCB) bandit algorithm in its final phase. An
interesting feature of $\mathtt{UE\text{-}UCB++}$ is that the coarser estimates
of the mean rewards formed during a uniform exploration phase help to refine
the encoding protocol in the next phase, leading to more accurate mean
estimates of the rewards in the subsequent phase. This positive reinforcement
cycle is critical to reducing the number of uniform exploration rounds and
closely matching our lower bound.
Related papers
- Stochastic Bandits Robust to Adversarial Attacks [33.278131584647745]
This paper investigates multi-armed bandit algorithms that are robust to adversarial attacks.
We study two cases of this model, with or without the knowledge of an attack budget $C$.
We devise two types of algorithms with regret bounds having additive or multiplicative $C$ dependence terms.
arXiv Detail & Related papers (2024-08-16T17:41:35Z) - Fast Rates for Bandit PAC Multiclass Classification [73.17969992976501]
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct.
Our main contribution is in designing a novel learning algorithm for the agnostic $(varepsilon,delta)$PAC version of the problem.
arXiv Detail & Related papers (2024-06-18T08:54:04Z) - Nearly Minimax Optimal Regret for Multinomial Logistic Bandit [8.087699764574788]
We study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information.
There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$.
We propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $tildeO(dsqrtsmash[b]T/K)$.
arXiv Detail & Related papers (2024-05-16T06:07:31Z) - Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery [45.601316850669406]
We present efficient algorithms for policy evaluation, best policy identification and regret minimization.
For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal.
All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix.
arXiv Detail & Related papers (2024-02-24T06:36:08Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Causal Bandits for Linear Structural Equation Models [58.2875460517691]
This paper studies the problem of designing an optimal sequence of interventions in a causal graphical model.
It is assumed that the graph's structure is known and has $N$ nodes.
Two algorithms are proposed for the frequentist (UCB-based) and Bayesian settings.
arXiv Detail & Related papers (2022-08-26T16:21:31Z) - Bandits Corrupted by Nature: Lower Bounds on Regret and Robust
Optimistic Algorithm [14.214707836697823]
We study the corrupted bandit problem, i.e. a multi-armed bandit problem with $k$ unknown reward distributions.
We propose a novel UCB-type algorithm for corrupted bandits, namely HubUCB, that builds on Huber's estimator for robust mean estimation.
We experimentally illustrate the efficiency of HubUCB and SeqHubUCB in solving corrupted bandits for different reward distributions and different levels of corruptions.
arXiv Detail & Related papers (2022-03-07T07:44:05Z) - Combinatorial Bandits without Total Order for Arms [52.93972547896022]
We present a reward model that captures set-dependent reward distribution and assumes no total order for arms.
We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
arXiv Detail & Related papers (2021-03-03T23:08:59Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.