A Regret bound for Non-stationary Multi-Armed Bandits with Fairness
Constraints
- URL: http://arxiv.org/abs/2012.13380v1
- Date: Thu, 24 Dec 2020 18:12:01 GMT
- Title: A Regret bound for Non-stationary Multi-Armed Bandits with Fairness
Constraints
- Authors: Shaarad A. R and Ambedkar Dukkipati
- Abstract summary: We present a new algorithm called Fair Upper Confidence Bound with Exploration Fair-UCBe for solving a slowly varying $k$-armed bandit problem.
We show that the performance of our algorithm in the non-stationary case approaches that of its stationary counterpart tends to zero.
- Score: 7.716156977428555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The multi-armed bandits' framework is the most common platform to study
strategies for sequential decision-making problems. Recently, the notion of
fairness has attracted a lot of attention in the machine learning community.
One can impose the fairness condition that at any given point of time, even
during the learning phase, a poorly performing candidate should not be
preferred over a better candidate. This fairness constraint is known to be one
of the most stringent and has been studied in the stochastic multi-armed
bandits' framework in a stationary setting for which regret bounds have been
established. The main aim of this paper is to study this problem in a
non-stationary setting. We present a new algorithm called Fair Upper Confidence
Bound with Exploration Fair-UCBe algorithm for solving a slowly varying
stochastic $k$-armed bandit problem. With this we present two results: (i)
Fair-UCBe indeed satisfies the above mentioned fairness condition, and (ii) it
achieves a regret bound of $O\left(k^{\frac{3}{2}} T^{1 - \frac{\alpha}{2}}
\sqrt{\log T}\right)$, for some suitable $\alpha \in (0, 1)$, where $T$ is the
time horizon. This is the first fair algorithm with a sublinear regret bound
applicable to non-stationary bandits to the best of our knowledge. We show that
the performance of our algorithm in the non-stationary case approaches that of
its stationary counterpart as the variation in the environment tends to zero.
Related papers
- Feel-Good Thompson Sampling for Contextual Dueling Bandits [49.450050682705026]
We propose a Thompson sampling algorithm, named FGTS.CDB, for linear contextual dueling bandits.
At the core of our algorithm is a new Feel-Good exploration term specifically tailored for dueling bandits.
Our algorithm achieves nearly minimax-optimal regret, i.e., $tildemathcalO(dsqrt T)$, where $d$ is the model dimension and $T$ is the time horizon.
arXiv Detail & Related papers (2024-04-09T04:45:18Z) - Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Non-Stationary Dueling Bandits [8.298716599039501]
We study the non-stationary dueling bandits problem with $K$ arms, where the time horizon $T$ consists of $M$ stationary segments.
To minimize the accumulated regret, the learner needs to pick the Condorcet winner of each stationary segment as often as possible.
We show a regret bound for the non-stationary case, without requiring knowledge of $M$ or $T$.
arXiv Detail & Related papers (2022-02-02T09:57:35Z) - Breaking the Moments Condition Barrier: No-Regret Algorithm for Bandits
with Super Heavy-Tailed Payoffs [27.636407641546914]
We propose a novel robust statistical estimator, mean of medians, which estimates a random variable by computing the empirical mean of a sequence of empirical medians.
We show that the regret bound is near-optimal even with very heavy-tailed noise.
arXiv Detail & Related papers (2021-10-26T17:30:44Z) - On Slowly-varying Non-stationary Bandits [25.305949034527202]
We consider dynamic regret in non-stationary bandits with a slowly varying property.
We establish the first instance-dependent regret upper bound for slowly varying non-stationary bandits.
We show that our algorithm is essentially minimax optimal.
arXiv Detail & Related papers (2021-10-25T12:56:19Z) - Upper Confidence Bounds for Combining Stochastic Bandits [52.10197476419621]
We provide a simple method to combine bandit algorithms.
Our approach is based on a "meta-UCB" procedure that treats each of $N$ individual bandit algorithms as arms in a higher-level $N$-armed bandit problem.
arXiv Detail & Related papers (2020-12-24T05:36:29Z) - Stochastic Linear Bandits Robust to Adversarial Attacks [117.665995707568]
We provide two variants of a Robust Phased Elimination algorithm, one that knows $C$ and one that does not.
We show that both variants attain near-optimal regret in the non-corrupted case $C = 0$, while incurring additional additive terms respectively.
In a contextual setting, we show that a simple greedy algorithm is provably robust with a near-optimal additive regret term, despite performing no explicit exploration and not knowing $C$.
arXiv Detail & Related papers (2020-07-07T09:00:57Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms [10.662105162882526]
We investigate a $k$-armed bandit problem in the emphmany-armed regime.
Our findings suggest a new form of emphfree exploration beneficial to greedy algorithms in the many-armed context.
arXiv Detail & Related papers (2020-02-24T08:59:34Z) - Regret Minimization in Stochastic Contextual Dueling Bandits [40.17224226373741]
We consider the problem of $K$-armed dueling bandit in the contextual setting.
We present two algorithms for the setup with respective regret guarantees.
arXiv Detail & Related papers (2020-02-20T06:36:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.