Weighted Tallying Bandits: Overcoming Intractability via Repeated
Exposure Optimality
- URL: http://arxiv.org/abs/2305.02955v1
- Date: Thu, 4 May 2023 15:59:43 GMT
- Title: Weighted Tallying Bandits: Overcoming Intractability via Repeated
Exposure Optimality
- Authors: Dhruv Malik, Conor Igoe, Yuanzhi Li, Aarti Singh
- Abstract summary: In recommender system or crowdsourcing applications, a human's preferences or abilities are often a function of the algorithm's recent actions.
We introduce the Weighted Tallying Bandit (WTB), which generalizes this setting by requiring that an action's loss is a function of a emph summation of the number of times that arm was played in the last $m$ timesteps.
We show that for problems with $K$ actions and horizon $T$, a simple modification of the successive elimination algorithm has $O left( sqrtKT +m+
- Score: 46.445321251991096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recommender system or crowdsourcing applications of online learning, a
human's preferences or abilities are often a function of the algorithm's recent
actions. Motivated by this, a significant line of work has formalized settings
where an action's loss is a function of the number of times that action was
recently played in the prior $m$ timesteps, where $m$ corresponds to a bound on
human memory capacity. To more faithfully capture decay of human memory with
time, we introduce the Weighted Tallying Bandit (WTB), which generalizes this
setting by requiring that an action's loss is a function of a \emph{weighted}
summation of the number of times that arm was played in the last $m$ timesteps.
This WTB setting is intractable without further assumption. So we study it
under Repeated Exposure Optimality (REO), a condition motivated by the
literature on human physiology, which requires the existence of an action that
when repetitively played will eventually yield smaller loss than any other
sequence of actions. We study the minimization of the complete policy regret
(CPR), which is the strongest notion of regret, in WTB under REO. Since $m$ is
typically unknown, we assume we only have access to an upper bound $M$ on $m$.
We show that for problems with $K$ actions and horizon $T$, a simple
modification of the successive elimination algorithm has $O \left( \sqrt{KT} +
(m+M)K \right)$ CPR. Interestingly, upto an additive (in lieu of
mutliplicative) factor in $(m+M)K$, this recovers the classical guarantee for
the simpler stochastic multi-armed bandit with traditional regret. We
additionally show that in our setting, any algorithm will suffer additive CPR
of $\Omega \left( mK + M \right)$, demonstrating our result is nearly optimal.
Our algorithm is computationally efficient, and we experimentally demonstrate
its practicality and superiority over natural baselines.
Related papers
- Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.
We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret
We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z) - Improved Regret Bound and Experience Replay in Regularized Policy
Iteration [22.621710838468097]
We study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation.
We first show that the regret analysis of the Politex algorithm can be sharpened from $O(T3/4)$ to $O(sqrtT)$ under nearly identical assumptions.
Our result provides the first high-probability $O(sqrtT)$ regret bound for a computationally efficient algorithm in this setting.
arXiv Detail & Related papers (2021-02-25T00:55:07Z) - Minimax Regret for Stochastic Shortest Path with Adversarial Costs and
Known Transition [37.6975819766632]
We study the shortest path problem with adversarial costs and known transition.
We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
arXiv Detail & Related papers (2020-12-07T20:55:28Z) - Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took.
While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results.
We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Simultaneously Learning Stochastic and Adversarial Episodic MDPs with
Known Transition [38.28925339231888]
We develop the first algorithm with a best-of-both-worlds'' guarantee.
It achieves $mathcalO(log T)$ regret when the losses are adversarial.
More generally, it achieves $tildemathcalO(sqrtC)$ regret in an intermediate setting.
arXiv Detail & Related papers (2020-06-10T01:59:34Z) - Learning Near Optimal Policies with Low Inherent Bellman Error [115.16037976819331]
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning.
We show that exploration is possible using only emphbatch assumptions with an algorithm that achieves the optimal statistical rate for the setting we consider.
arXiv Detail & Related papers (2020-02-29T02:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.