Top $K$ Ranking for Multi-Armed Bandit with Noisy Evaluations
- URL: http://arxiv.org/abs/2112.06517v2
- Date: Tue, 14 Dec 2021 16:35:18 GMT
- Title: Top $K$ Ranking for Multi-Armed Bandit with Noisy Evaluations
- Authors: Evrard Garcelon and Vashist Avadhanula and Alessandro Lazaric and
Matteo Pirotta
- Abstract summary: We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent evaluations of the true reward of each arm.
We derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are generated.
- Score: 102.32996053572144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider a multi-armed bandit setting where, at the beginning of each
round, the learner receives noisy independent, and possibly biased,
\emph{evaluations} of the true reward of each arm and it selects $K$ arms with
the objective of accumulating as much reward as possible over $T$ rounds. Under
the assumption that at each round the true reward of each arm is drawn from a
fixed distribution, we derive different algorithmic approaches and theoretical
guarantees depending on how the evaluations are generated. First, we show a
$\widetilde{O}(T^{2/3})$ regret in the general case when the observation
functions are a genearalized linear function of the true rewards. On the other
hand, we show that an improved $\widetilde{O}(\sqrt{T})$ regret can be derived
when the observation functions are noisy linear functions of the true rewards.
Finally, we report an empirical validation that confirms our theoretical
findings, provides a thorough comparison to alternative approaches, and further
supports the interest of this setting in practice.
Related papers
- Sparsity-Agnostic Linear Bandits with Adaptive Adversaries [19.84322270472381]
We study linear bandits where, in each round, the learner receives a set of actions (i.e., feature vectors) from which it chooses an element and obtains a reward.
The expected reward is a fixed but unknown linear function of the chosen action.
We study sparse regret bounds, that depend on the number $S$ of non-zero coefficients in the linear reward function.
arXiv Detail & Related papers (2024-06-03T10:54:58Z) - Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Multi-Armed Bandits with Interference [39.578572123342944]
We show that switchback policies achieve an optimal em expected regret $tilde O(sqrt T)$ against the best fixed-arm policy.
We propose a cluster randomization policy whose regret is optimal in em expectation.
arXiv Detail & Related papers (2024-02-02T19:02:47Z) - Combinatorial Bandits for Maximum Value Reward Function under Max
Value-Index Feedback [9.771002043127728]
We consider a multi-armed bandit problem for maximum value reward function under maximum value and index feedback.
We propose an algorithm and provide a regret bound for problem instances with arm outcomes according to arbitrary distributions with finite supports.
Our algorithm achieves a $O((k/Delta)log(T))$ distribution-dependent and a $tildeO(sqrtT)$ distribution-independent regret.
arXiv Detail & Related papers (2023-05-25T14:02:12Z) - Randomized Greedy Learning for Non-monotone Stochastic Submodular
Maximization Under Full-bandit Feedback [98.29086113546045]
We investigate the problem of unconstrained multi-armed bandits with full-bandit feedback and rewards for submodularity.
We show that RGL empirically outperforms other full-bandit variants in submodular and non-submodular settings.
arXiv Detail & Related papers (2023-02-02T18:52:14Z) - Beyond the Best: Estimating Distribution Functionals in Infinite-Armed
Bandits [40.71199236098642]
In the infinite-armed bandit problem, each arm's average reward is sampled from an unknown distribution.
We consider a general class of distribution functionals beyond the maximum, and propose unified meta algorithms for both the offline and online settings.
arXiv Detail & Related papers (2022-11-01T18:20:10Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits [11.918230810566945]
We study the non-stationary multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning.
We propose a method that achieves, in $K$-armed bandit problems, a near-optimal $widetilde O(sqrtK N(S+1))$ dynamic regret.
arXiv Detail & Related papers (2022-01-17T17:23:56Z) - Combinatorial Bandits without Total Order for Arms [52.93972547896022]
We present a reward model that captures set-dependent reward distribution and assumes no total order for arms.
We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
arXiv Detail & Related papers (2021-03-03T23:08:59Z) - Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous.
We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy.
We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.