Congested Bandits: Optimal Routing via Short-term Resets
- URL: http://arxiv.org/abs/2301.09251v1
- Date: Mon, 23 Jan 2023 03:11:06 GMT
- Title: Congested Bandits: Optimal Routing via Short-term Resets
- Authors: Pranjal Awasthi, Kush Bhatia, Sreenivas Gollapudi, Kostas Kollias
- Abstract summary: We study the problem of Congested Bandits where each arm's reward is allowed to depend on the number of times it was played in the past.
We propose a UCB style algorithm and show that its policy regret scales as $tildeO(sqrtK Delta T)$.
For the linear contextual bandit setup, our algorithm, based on an iterative least squares planner, achieves policy regret $tildeO(sqrtdT + Delta)$.
- Score: 30.892724364965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For traffic routing platforms, the choice of which route to recommend to a
user depends on the congestion on these routes -- indeed, an individual's
utility depends on the number of people using the recommended route at that
instance. Motivated by this, we introduce the problem of Congested Bandits
where each arm's reward is allowed to depend on the number of times it was
played in the past $\Delta$ timesteps. This dependence on past history of
actions leads to a dynamical system where an algorithm's present choices also
affect its future pay-offs, and requires an algorithm to plan for this. We
study the congestion aware formulation in the multi-armed bandit (MAB) setup
and in the contextual bandit setup with linear rewards. For the multi-armed
setup, we propose a UCB style algorithm and show that its policy regret scales
as $\tilde{O}(\sqrt{K \Delta T})$. For the linear contextual bandit setup, our
algorithm, based on an iterative least squares planner, achieves policy regret
$\tilde{O}(\sqrt{dT} + \Delta)$. From an experimental standpoint, we
corroborate the no-regret properties of our algorithms via a simulation study.
Related papers
- Feel-Good Thompson Sampling for Contextual Dueling Bandits [49.450050682705026]
We propose a Thompson sampling algorithm, named FGTS.CDB, for linear contextual dueling bandits.
At the core of our algorithm is a new Feel-Good exploration term specifically tailored for dueling bandits.
Our algorithm achieves nearly minimax-optimal regret, i.e., $tildemathcalO(dsqrt T)$, where $d$ is the model dimension and $T$ is the time horizon.
arXiv Detail & Related papers (2024-04-09T04:45:18Z) - Stochastic Rising Bandits [40.32303434592863]
We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing.
This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds.
We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset.
arXiv Detail & Related papers (2022-12-07T17:30:45Z) - Non-Stationary Bandits under Recharging Payoffs: Improved Planning with
Sublinear Regret [34.44347218903429]
The multi-armed bandit setting has been recently studied in the non-stationary regime.
Mean payoff of each action is a non-decreasing function of the number of rounds passed since it was last played.
We show how our algorithm can be transformed into a bandit algorithm with sublinear regret.
arXiv Detail & Related papers (2022-05-29T23:55:36Z) - Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.
We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret
We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z) - Modeling Attrition in Recommender Systems with Departing Bandits [84.85560764274399]
We propose a novel multi-armed bandit setup that captures policy-dependent horizons.
We first address the case where all users share the same type, demonstrating that a recent UCB-based algorithm is optimal.
We then move forward to the more challenging case, where users are divided among two types.
arXiv Detail & Related papers (2022-03-25T02:30:54Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards.
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z) - Bandit algorithms to emulate human decision making using probabilistic
distortions [20.422725678982726]
We formulate two sample multi-armed bandit problems with distorted probabilities on the reward distributions.
We consider the aforementioned problems in the regret minimization as well as best arm identification framework for multi-armed bandits.
arXiv Detail & Related papers (2016-11-30T17:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.