Complete Policy Regret Bounds for Tallying Bandits
- URL: http://arxiv.org/abs/2204.11174v1
- Date: Sun, 24 Apr 2022 03:10:27 GMT
- Title: Complete Policy Regret Bounds for Tallying Bandits
- Authors: Dhruv Malik, Yuanzhi Li, Aarti Singh
- Abstract summary: Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.
We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret
We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
- Score: 51.039677652803675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy regret is a well established notion of measuring the performance of an
online learning algorithm against an adaptive adversary. We study restrictions
on the adversary that enable efficient minimization of the \emph{complete
policy regret}, which is the strongest possible version of policy regret. We
identify a gap in the current theoretical understanding of what sorts of
restrictions permit tractability in this challenging setting. To resolve this
gap, we consider a generalization of the stochastic multi armed bandit, which
we call the \emph{tallying bandit}. This is an online learning setting with an
$m$-memory bounded adversary, where the average loss for playing an action is
an unknown function of the number (or tally) of times that the action was
played in the last $m$ timesteps. For tallying bandit problems with $K$ actions
and time horizon $T$, we provide an algorithm that w.h.p achieves a complete
policy regret guarantee of $\tilde{\mathcal{O}}(mK\sqrt{T})$, where the
$\tilde{\mathcal{O}}$ notation hides only logarithmic factors. We additionally
prove an $\tilde\Omega(\sqrt{m K T})$ lower bound on the expected complete
policy regret of any tallying bandit algorithm, demonstrating the near
optimality of our method.
Related papers
- Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Weighted Tallying Bandits: Overcoming Intractability via Repeated
Exposure Optimality [46.445321251991096]
In recommender system or crowdsourcing applications, a human's preferences or abilities are often a function of the algorithm's recent actions.
We introduce the Weighted Tallying Bandit (WTB), which generalizes this setting by requiring that an action's loss is a function of a emph summation of the number of times that arm was played in the last $m$ timesteps.
We show that for problems with $K$ actions and horizon $T$, a simple modification of the successive elimination algorithm has $O left( sqrtKT +m+
arXiv Detail & Related papers (2023-05-04T15:59:43Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Bounded Memory Adversarial Bandits with Composite Anonymous Delayed
Feedback [10.179145161000315]
We study the adversarial bandit problem with composite anonymous delayed feedback.
In this setting, losses of an action are split into $d$ components, spreading over consecutive rounds after the action is chosen.
We show non-oblivious setting incurs $Omega(T)$ pseudo regret even when the loss sequence is bounded memory.
arXiv Detail & Related papers (2022-04-27T08:32:35Z) - Corralling a Larger Band of Bandits: A Case Study on Switching Regret
for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly.
The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon.
Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z) - A Short Note on Soft-max and Policy Gradients in Bandits Problems [0.0]
We give a short argument that gives a regret bound for the soft-max ordinary differential equation for bandit problems.
We derive a similar result for a different policy gradient algorithm, again for bandit problems.
arXiv Detail & Related papers (2020-07-20T17:30:27Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.