Nash Regret Guarantees for Linear Bandits
- URL: http://arxiv.org/abs/2310.02023v1
- Date: Tue, 3 Oct 2023 12:58:10 GMT
- Title: Nash Regret Guarantees for Linear Bandits
- Authors: Ayush Sawarni, Soumybrata Pal, and Siddharth Barman
- Abstract summary: We develop an algorithm that achieves a Nash regret of $Oleft( sqrtfracdnuT log( T |X|)right)$.
In addition, addressing linear bandit instances in which the set of arms $X$ is not necessarily finite, we obtain a Nash regret upper bound $Oleft( fracdfrac54nufrac12sqrtT log(T)right)$.
- Score: 12.494184403263338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We obtain essentially tight upper bounds for a strengthened notion of regret
in the stochastic linear bandits framework. The strengthening -- referred to as
Nash regret -- is defined as the difference between the (a priori unknown)
optimum and the geometric mean of expected rewards accumulated by the linear
bandit algorithm. Since the geometric mean corresponds to the well-studied Nash
social welfare (NSW) function, this formulation quantifies the performance of a
bandit algorithm as the collective welfare it generates across rounds. NSW is
known to satisfy fairness axioms and, hence, an upper bound on Nash regret
provides a principled fairness guarantee.
We consider the stochastic linear bandits problem over a horizon of $T$
rounds and with set of arms ${X}$ in ambient dimension $d$. Furthermore, we
focus on settings in which the stochastic reward -- associated with each arm in
${X}$ -- is a non-negative, $\nu$-sub-Poisson random variable. For this
setting, we develop an algorithm that achieves a Nash regret of $O\left(
\sqrt{\frac{d\nu}{T}} \log( T |X|)\right)$. In addition, addressing linear
bandit instances in which the set of arms ${X}$ is not necessarily finite, we
obtain a Nash regret upper bound of $O\left(
\frac{d^\frac{5}{4}\nu^{\frac{1}{2}}}{\sqrt{T}} \log(T)\right)$. Since bounded
random variables are sub-Poisson, these results hold for bounded, positive
rewards. Our linear bandit algorithm is built upon the successive elimination
method with novel technical insights, including tailored concentration bounds
and the use of sampling via John ellipsoid in conjunction with the
Kiefer-Wolfowitz optimal design.
Related papers
- Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Communication-Constrained Bandits under Additive Gaussian Noise [111.06688156723018]
We study a distributed multi-armed bandit where a client supplies the learner with communication-constrained feedback.
We propose a multi-phase bandit algorithm, $mathttUEtext-UCB++$, which matches this lower bound to a minor additive factor.
arXiv Detail & Related papers (2023-04-25T09:31:20Z) - Fairness and Welfare Quantification for Regret in Multi-Armed Bandits [13.094997642327371]
We extend the notion of regret with a welfarist perspective.
We study Nash regret, defined as the difference between the -- a priori unknown -- optimal mean (among the arms) and the algorithm's performance.
arXiv Detail & Related papers (2022-05-27T12:12:56Z) - Nearly Minimax Algorithms for Linear Bandits with Shared Representation [86.79657561369397]
We consider the setting where we play $M$ linear bandits with dimension $d$, each for $T$ rounds, and these $M$ bandit tasks share a common $k(ll d)$ dimensional linear representation.
We come up with novel algorithms that achieve $widetildeOleft(dsqrtkMT + kMsqrtTright)$ regret bounds, which matches the known minimax regret lower bound up to logarithmic factors.
arXiv Detail & Related papers (2022-03-29T15:27:13Z) - Adversarial Combinatorial Bandits with General Non-linear Reward
Functions [29.775097827244853]
We study the adversarial bandit with a known non-linear reward function, extending existing work on adversarial linear bandit.
We show that with $N$ arms and subsets of $K$ arms being chosen at each of $T$ time periods, the minimax optimal regret is $widetildeTheta_d(Nd T)$ if the reward function is a $d$-degree with $d K$, and $ta_K(sqrtK T)$ if the reward function is not a low-degree.
arXiv Detail & Related papers (2021-01-05T00:56:27Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z) - Stochastic Linear Bandits Robust to Adversarial Attacks [117.665995707568]
We provide two variants of a Robust Phased Elimination algorithm, one that knows $C$ and one that does not.
We show that both variants attain near-optimal regret in the non-corrupted case $C = 0$, while incurring additional additive terms respectively.
In a contextual setting, we show that a simple greedy algorithm is provably robust with a near-optimal additive regret term, despite performing no explicit exploration and not knowing $C$.
arXiv Detail & Related papers (2020-07-07T09:00:57Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Regret and Belief Complexity Trade-off in Gaussian Process Bandits via
Information Thresholding [42.669970064867556]
We show how to characterize the trade-off between regret bounds of GP bandit algorithms and complexity of the posterior distributions.
We observe state of the art accuracy and complexity trade-offs for GP bandit algorithms applied to global optimization.
arXiv Detail & Related papers (2020-03-23T21:05:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.