Restless-UCB, an Efficient and Low-complexity Algorithm for Online
Restless Bandits
- URL: http://arxiv.org/abs/2011.02664v2
- Date: Fri, 6 Nov 2020 08:00:22 GMT
- Title: Restless-UCB, an Efficient and Low-complexity Algorithm for Online
Restless Bandits
- Authors: Siwei Wang, Longbo Huang, John C.S. Lui
- Abstract summary: We study the online restless bandit problem, where the state of each arm evolves according to a Markov chain.
We propose Restless-UCB, a learning policy that follows the explore-then-commit framework.
- Score: 61.490254407420906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the online restless bandit problem, where the state of each arm
evolves according to a Markov chain, and the reward of pulling an arm depends
on both the pulled arm and the current state of the corresponding Markov chain.
In this paper, we propose Restless-UCB, a learning policy that follows the
explore-then-commit framework. In Restless-UCB, we present a novel method to
construct offline instances, which only requires $O(N)$ time-complexity ($N$ is
the number of arms) and is exponentially better than the complexity of existing
learning policy. We also prove that Restless-UCB achieves a regret upper bound
of $\tilde{O}((N+M^3)T^{2\over 3})$, where $M$ is the Markov chain state space
size and $T$ is the time horizon. Compared to existing algorithms, our result
eliminates the exponential factor (in $M,N$) in the regret upper bound, due to
a novel exploitation of the sparsity in transitions in general restless bandit
problems. As a result, our analysis technique can also be adopted to tighten
the regret bounds of existing algorithms. Finally, we conduct experiments based
on real-world dataset, to compare the Restless-UCB policy with state-of-the-art
benchmarks. Our results show that Restless-UCB outperforms existing algorithms
in regret, and significantly reduces the running time.
Related papers
- Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits [55.938644481736446]
Indexed Minimum Empirical Divergence (IMED) is a highly effective approach to the multi-armed bandit problem.
It has been observed to empirically outperform UCB-based algorithms and Thompson Sampling.
We present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms.
arXiv Detail & Related papers (2024-05-24T04:11:58Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Weighted Tallying Bandits: Overcoming Intractability via Repeated
Exposure Optimality [46.445321251991096]
In recommender system or crowdsourcing applications, a human's preferences or abilities are often a function of the algorithm's recent actions.
We introduce the Weighted Tallying Bandit (WTB), which generalizes this setting by requiring that an action's loss is a function of a emph summation of the number of times that arm was played in the last $m$ timesteps.
We show that for problems with $K$ actions and horizon $T$, a simple modification of the successive elimination algorithm has $O left( sqrtKT +m+
arXiv Detail & Related papers (2023-05-04T15:59:43Z) - Stochastic Rising Bandits [40.32303434592863]
We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing.
This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds.
We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset.
arXiv Detail & Related papers (2022-12-07T17:30:45Z) - Achieving the Pareto Frontier of Regret Minimization and Best Arm
Identification in Multi-Armed Bandits [91.8283876874947]
We design and analyze the BoBW-lil'UCB$(gamma)$ algorithm.
We show that (i) no algorithm can simultaneously perform optimally for both the RM and BAI objectives.
We also show that BoBW-lil'UCB$(gamma)$ outperforms a competitor in terms of the time complexity and the regret.
arXiv Detail & Related papers (2021-10-16T17:52:32Z) - Reinforcement Learning for Finite-Horizon Restless Multi-Armed
Multi-Action Bandits [8.136957953239254]
We study a finite-horizon restless multi-armed bandit problem with multiple actions dubbed R(MA)2B.
The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken.
Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy.
arXiv Detail & Related papers (2021-09-20T21:40:12Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Decentralized Multi-Agent Linear Bandits with Safety Constraints [31.67685495996986]
We study decentralized linear bandits, where a network of $N$ agents acts cooperatively to solve a linear bandit-optimization problem.
We propose DLUCB: a fully decentralized algorithm that minimizes the cumulative regret over the entire network.
We show that our ideas extend naturally to the emerging, albeit more challenging, setting of safe bandits.
arXiv Detail & Related papers (2020-12-01T07:33:00Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms [10.662105162882526]
We investigate a $k$-armed bandit problem in the emphmany-armed regime.
Our findings suggest a new form of emphfree exploration beneficial to greedy algorithms in the many-armed context.
arXiv Detail & Related papers (2020-02-24T08:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.