Reinforcement Learning for Finite-Horizon Restless Multi-Armed
Multi-Action Bandits
- URL: http://arxiv.org/abs/2109.09855v1
- Date: Mon, 20 Sep 2021 21:40:12 GMT
- Title: Reinforcement Learning for Finite-Horizon Restless Multi-Armed
Multi-Action Bandits
- Authors: Guojun Xiong, Jian Li, Rahul Singh
- Abstract summary: We study a finite-horizon restless multi-armed bandit problem with multiple actions dubbed R(MA)2B.
The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken.
Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy.
- Score: 8.136957953239254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study a finite-horizon restless multi-armed bandit problem with multiple
actions, dubbed R(MA)^2B. The state of each arm evolves according to a
controlled Markov decision process (MDP), and the reward of pulling an arm
depends on both the current state of the corresponding MDP and the action
taken. The goal is to sequentially choose actions for arms so as to maximize
the expected value of the cumulative rewards collected. Since finding the
optimal policy is typically intractable, we propose a computationally appealing
index policy which we call Occupancy-Measured-Reward Index Policy. Our policy
is well-defined even if the underlying MDPs are not indexable. We prove that it
is asymptotically optimal when the activation budget and number of arms are
scaled up, while keeping their ratio as a constant. For the case when the
system parameters are unknown, we develop a learning algorithm. Our learning
algorithm uses the principle of optimism in the face of uncertainty and further
uses a generative model in order to fully exploit the structure of
Occupancy-Measured-Reward Index Policy. We call it the R(MA)^2B-UCB algorithm.
As compared with the existing algorithms, R(MA)^2B-UCB performs close to an
offline optimum policy, and also achieves a sub-linear regret with a low
computational complexity. Experimental results show that R(MA)^2B-UCB
outperforms the existing algorithms in both regret and run time.
Related papers
- GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits [16.054685587034836]
GINO-Q is a three-timescale approximation algorithm designed to learn an optimal index policy for restless multi-armed bandit (RMAB)
GINO-Q does not require RMABs to be indexable, enhancing its flexibility and applicability.
Our experimental results demonstrate that GINO-Q consistently learns near optimal policies, even for non-indexable RMABs.
arXiv Detail & Related papers (2024-08-19T10:50:45Z) - Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs [9.58750210024265]
We consider (stochastic) softmax policy gradient (PG) methods for bandits and Markov decision processes (MDPs)
We show that the proposed algorithm offers similar theoretical guarantees as the state-of-the art results, but does not require the knowledge of oracle-like quantities.
For the multi-armed bandit setting, our techniques result in a theoretically-principled PG algorithm that does not require explicit exploration, the knowledge of the reward gap, the reward distributions, or the noise.
arXiv Detail & Related papers (2024-05-21T18:12:39Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - Distributed Consensus Algorithm for Decision-Making in Multi-agent
Multi-armed Bandit [7.708904950194129]
We study a structured multi-agent multi-armed bandit (MAMAB) problem in a dynamic environment.
A graph reflects the information-sharing structure among agents, and the arms' reward distributions are piecewise-stationary with several unknown change points.
The goal is to develop a decision-making policy for the agents that minimizes the regret, which is the expected total loss of not playing the optimal arm at each time step.
arXiv Detail & Related papers (2023-06-09T16:10:26Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Indexability is Not Enough for Whittle: Improved, Near-Optimal
Algorithms for Restless Bandits [30.532795983761314]
We study the problem of planning restless multi-armed bandits (RMABs) with multiple actions.
We first show that Whittle index policies can fail in simple and practically relevant settings.
We then propose an alternate planning algorithm based on the mean-field method.
arXiv Detail & Related papers (2022-10-31T19:35:15Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Nearly Optimal Latent State Decoding in Block MDPs [74.51224067640717]
In episodic Block MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states.
We are first interested in estimating the latent state decoding function based on data generated under a fixed behavior policy.
We then study the problem of learning near-optimal policies in the reward-free framework.
arXiv Detail & Related papers (2022-08-17T18:49:53Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Robust Reinforcement Learning using Least Squares Policy Iteration with
Provable Performance Guarantees [3.8073142980733]
This paper addresses the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces.
We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation.
We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy.
arXiv Detail & Related papers (2020-06-20T16:26:50Z) - Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB.
We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits.
Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.