Related papers: Communication-Constrained Bandits under Additive Gaussian Noise

Communication-Constrained Bandits under Additive Gaussian Noise

URL: http://arxiv.org/abs/2304.12680v2
Date: Tue, 6 Jun 2023 10:22:56 GMT
Title: Communication-Constrained Bandits under Additive Gaussian Noise
Authors: Prathamesh Mayekar, Jonathan Scarlett, and Vincent Y.F. Tan
Abstract summary: We study a distributed multi-armed bandit where a client supplies the learner with communication-constrained feedback. We propose a multi-phase bandit algorithm, $mathttUEtext-UCB++$, which matches this lower bound to a minor additive factor.
Score: 111.06688156723018
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study a distributed stochastic multi-armed bandit where a client supplies the learner with communication-constrained feedback based on the rewards for the corresponding arm pulls. In our setup, the client must encode the rewards such that the second moment of the encoded rewards is no more than $P$, and this encoded reward is further corrupted by additive Gaussian noise of variance $\sigma^2$; the learner only has access to this corrupted reward. For this setting, we derive an information-theoretic lower bound of $\Omega\left(\sqrt{\frac{KT}{\mathtt{SNR} \wedge1}} \right)$ on the minimax regret of any scheme, where $ \mathtt{SNR} := \frac{P}{\sigma^2}$, and $K$ and $T$ are the number of arms and time horizon, respectively. Furthermore, we propose a multi-phase bandit algorithm, $\mathtt{UE\text{-}UCB++}$, which matches this lower bound to a minor additive factor. $\mathtt{UE\text{-}UCB++}$ performs uniform exploration in its initial phases and then utilizes the {\em upper confidence bound }(UCB) bandit algorithm in its final phase. An interesting feature of $\mathtt{UE\text{-}UCB++}$ is that the coarser estimates of the mean rewards formed during a uniform exploration phase help to refine the encoding protocol in the next phase, leading to more accurate mean estimates of the rewards in the subsequent phase. This positive reinforcement cycle is critical to reducing the number of uniform exploration rounds and closely matching our lower bound.

Related papers

Continuous K-Max Bandits [54.21533414838677]
We study the $K$-Max multi-armed bandits problem with continuous outcome distributions and weak value-index feedback. This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc. Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds.
arXiv Detail & Related papers (2025-02-19T06:37:37Z)
Stochastic Bandits Robust to Adversarial Attacks [33.278131584647745]
This paper investigates multi-armed bandit algorithms that are robust to adversarial attacks. We study two cases of this model, with or without the knowledge of an attack budget $C$. We devise two types of algorithms with regret bounds having additive or multiplicative $C$ dependence terms.
arXiv Detail & Related papers (2024-08-16T17:41:35Z)
Fast Rates for Bandit PAC Multiclass Classification [73.17969992976501]
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct. Our main contribution is in designing a novel learning algorithm for the agnostic $(varepsilon,delta)$PAC version of the problem.
arXiv Detail & Related papers (2024-06-18T08:54:04Z)
Nearly Minimax Optimal Regret for Multinomial Logistic Bandit [8.087699764574788]
We study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. We propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $tildeO(dsqrtsmash[b]T/K)$.
arXiv Detail & Related papers (2024-05-16T06:07:31Z)
Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery [45.601316850669406]
We present efficient algorithms for policy evaluation, best policy identification and regret minimization. For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal. All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix.
arXiv Detail & Related papers (2024-02-24T06:36:08Z)
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$. We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes. We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z)
Causal Bandits for Linear Structural Equation Models [58.2875460517691]
This paper studies the problem of designing an optimal sequence of interventions in a causal graphical model. It is assumed that the graph's structure is known and has $N$ nodes. Two algorithms are proposed for the frequentist (UCB-based) and Bayesian settings.
arXiv Detail & Related papers (2022-08-26T16:21:31Z)
Bandits Corrupted by Nature: Lower Bounds on Regret and Robust Optimistic Algorithm [14.214707836697823]
We study the corrupted bandit problem, i.e. a multi-armed bandit problem with $k$ unknown reward distributions. We propose a novel UCB-type algorithm for corrupted bandits, namely HubUCB, that builds on Huber's estimator for robust mean estimation. We experimentally illustrate the efficiency of HubUCB and SeqHubUCB in solving corrupted bandits for different reward distributions and different levels of corruptions.
arXiv Detail & Related papers (2022-03-07T07:44:05Z)
Combinatorial Bandits without Total Order for Arms [52.93972547896022]
We present a reward model that captures set-dependent reward distribution and assumes no total order for arms. We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
arXiv Detail & Related papers (2021-03-03T23:08:59Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.