Restless-UCB, an Efficient and Low-complexity Algorithm for Online
Restless Bandits
- URL: http://arxiv.org/abs/2011.02664v2
- Date: Fri, 6 Nov 2020 08:00:22 GMT
- Title: Restless-UCB, an Efficient and Low-complexity Algorithm for Online
Restless Bandits
- Authors: Siwei Wang, Longbo Huang, John C.S. Lui
- Abstract summary: We study the online restless bandit problem, where the state of each arm evolves according to a Markov chain.
We propose Restless-UCB, a learning policy that follows the explore-then-commit framework.
- Score: 61.490254407420906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the online restless bandit problem, where the state of each arm
evolves according to a Markov chain, and the reward of pulling an arm depends
on both the pulled arm and the current state of the corresponding Markov chain.
In this paper, we propose Restless-UCB, a learning policy that follows the
explore-then-commit framework. In Restless-UCB, we present a novel method to
construct offline instances, which only requires $O(N)$ time-complexity ($N$ is
the number of arms) and is exponentially better than the complexity of existing
learning policy. We also prove that Restless-UCB achieves a regret upper bound
of $\tilde{O}((N+M^3)T^{2\over 3})$, where $M$ is the Markov chain state space
size and $T$ is the time horizon. Compared to existing algorithms, our result
eliminates the exponential factor (in $M,N$) in the regret upper bound, due to
a novel exploitation of the sparsity in transitions in general restless bandit
problems. As a result, our analysis technique can also be adopted to tighten
the regret bounds of existing algorithms. Finally, we conduct experiments based
on real-world dataset, to compare the Restless-UCB policy with state-of-the-art
benchmarks. Our results show that Restless-UCB outperforms existing algorithms
in regret, and significantly reduces the running time.
Related papers
- Variance-Aware Linear UCB with Deep Representation for Neural Contextual Bandits [9.877915844066338]
A neural upper confidence bound (UCB) algorithm has shown success in contextual bandits.
We propose a variance-aware algorithm that utilizes $sigma2_t$, i.e., an upper bound of the reward noise variance at round $t$.
We provide an oracle version for our algorithm characterized by an oracle variance upper bound $sigma2_t$ and a practical version with a novel estimation for this variance bound.
arXiv Detail & Related papers (2024-11-08T21:24:14Z) - Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits [55.938644481736446]
Indexed Minimum Empirical Divergence (IMED) is a highly effective approach to the multi-armed bandit problem.
It has been observed to empirically outperform UCB-based algorithms and Thompson Sampling.
We present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms.
arXiv Detail & Related papers (2024-05-24T04:11:58Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Contextual Combinatorial Bandits with Probabilistically Triggered Arms [55.9237004478033]
We study contextual bandits with probabilistically triggered arms (C$2$MAB-T) under a variety of smoothness conditions.
Under the triggering modulated (TPM) condition, we devise the C$2$-UC-T algorithm and derive a regret bound $tildeO(dsqrtT)$.
arXiv Detail & Related papers (2023-03-30T02:51:00Z) - Stochastic Rising Bandits [40.32303434592863]
We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing.
This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds.
We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset.
arXiv Detail & Related papers (2022-12-07T17:30:45Z) - Achieving the Pareto Frontier of Regret Minimization and Best Arm
Identification in Multi-Armed Bandits [91.8283876874947]
We design and analyze the BoBW-lil'UCB$(gamma)$ algorithm.
We show that (i) no algorithm can simultaneously perform optimally for both the RM and BAI objectives.
We also show that BoBW-lil'UCB$(gamma)$ outperforms a competitor in terms of the time complexity and the regret.
arXiv Detail & Related papers (2021-10-16T17:52:32Z) - Reinforcement Learning for Finite-Horizon Restless Multi-Armed
Multi-Action Bandits [8.136957953239254]
We study a finite-horizon restless multi-armed bandit problem with multiple actions dubbed R(MA)2B.
The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken.
Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy.
arXiv Detail & Related papers (2021-09-20T21:40:12Z) - Decentralized Multi-Agent Linear Bandits with Safety Constraints [31.67685495996986]
We study decentralized linear bandits, where a network of $N$ agents acts cooperatively to solve a linear bandit-optimization problem.
We propose DLUCB: a fully decentralized algorithm that minimizes the cumulative regret over the entire network.
We show that our ideas extend naturally to the emerging, albeit more challenging, setting of safe bandits.
arXiv Detail & Related papers (2020-12-01T07:33:00Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms [10.662105162882526]
We investigate a $k$-armed bandit problem in the emphmany-armed regime.
Our findings suggest a new form of emphfree exploration beneficial to greedy algorithms in the many-armed context.
arXiv Detail & Related papers (2020-02-24T08:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.