Rotting Infinitely Many-armed Bandits
- URL: http://arxiv.org/abs/2201.12975v3
- Date: Sun, 17 Dec 2023 10:16:48 GMT
- Title: Rotting Infinitely Many-armed Bandits
- Authors: Jung-hun Kim, Milan Vojnovic, Se-Young Yun
- Abstract summary: We show an $Omega(maxvarrho2/3T,sqrtT)$ worst-case regret lower bound where $T$ is the horizon time.
An algorithm that does not know the value of $varrho$ can be achieved by using an adaptive UCB index along with an adaptive threshold value.
- Score: 33.26370721280417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the infinitely many-armed bandit problem with rotting rewards,
where the mean reward of an arm decreases at each pull of the arm according to
an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this
learning problem has an $\Omega(\max\{\varrho^{1/3}T,\sqrt{T}\})$ worst-case
regret lower bound where $T$ is the horizon time. We show that a matching upper
bound $\tilde{O}(\max\{\varrho^{1/3}T,\sqrt{T}\})$, up to a poly-logarithmic
factor, can be achieved by an algorithm that uses a UCB index for each arm and
a threshold value to decide whether to continue pulling an arm or remove the
arm from further consideration, when the algorithm knows the value of the
maximum rotting rate $\varrho$. We also show that an
$\tilde{O}(\max\{\varrho^{1/3}T,T^{3/4}\})$ regret upper bound can be achieved
by an algorithm that does not know the value of $\varrho$, by using an adaptive
UCB index along with an adaptive threshold value.
Related papers
- Adversarial Combinatorial Bandits with Switching Costs [55.2480439325792]
We study the problem of adversarial bandit with a switching cost $lambda$ for a switch of each selected arm in each round.
We derive lower bounds for the minimax regret and design algorithms to approach them.
arXiv Detail & Related papers (2024-04-02T12:15:37Z) - Combinatorial Bandits for Maximum Value Reward Function under Max
Value-Index Feedback [9.771002043127728]
We consider a multi-armed bandit problem for maximum value reward function under maximum value and index feedback.
We propose an algorithm and provide a regret bound for problem instances with arm outcomes according to arbitrary distributions with finite supports.
Our algorithm achieves a $O((k/Delta)log(T))$ distribution-dependent and a $tildeO(sqrtT)$ distribution-independent regret.
arXiv Detail & Related papers (2023-05-25T14:02:12Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Non-stationary Bandits and Meta-Learning with a Small Set of Optimal
Arms [30.024167992890916]
We study a decision where a learner faces a sequence of $K$-armed bandit tasks.
The adversary is constrained to choose the optimal arm of each task in a smaller (but unknown) subset of $M$ arms.
The boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting)
arXiv Detail & Related papers (2022-02-25T22:28:01Z) - Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays [21.94728545221709]
We consider the Scale-Free Adversarial Multi Armed Bandit (MAB) problem with unrestricted feedback delays.
textttSFBanker achieves $mathcal O(sqrtK(D+T)L)cdot rm polylog(T, L)$ total regret, where $T$ is the total number of steps and $D$ is the total feedback delay.
arXiv Detail & Related papers (2021-10-26T04:06:51Z) - Bandits with many optimal arms [68.17472536610859]
We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms.
We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
arXiv Detail & Related papers (2021-03-23T11:02:31Z) - Combinatorial Bandits without Total Order for Arms [52.93972547896022]
We present a reward model that captures set-dependent reward distribution and assumes no total order for arms.
We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
arXiv Detail & Related papers (2021-03-03T23:08:59Z) - Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous.
We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy.
We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.