Simulation Based Algorithms for Markov Decision Processes and
Multi-Action Restless Bandits
- URL: http://arxiv.org/abs/2007.12933v1
- Date: Sat, 25 Jul 2020 13:50:08 GMT
- Title: Simulation Based Algorithms for Markov Decision Processes and
Multi-Action Restless Bandits
- Authors: Rahul Meshram and Kesav Kaza
- Abstract summary: We consider a restless multi-armed bandit (RMAB) with multi-dimensional state space and multi-actions bandit model.
We first analyze a standard indexable RMAB (two-action model) and discuss an index based policy approach.
We present approximate index algorithm using Monte-Carlo rollout policy.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider multi-dimensional Markov decision processes and formulate a long
term discounted reward optimization problem. Two simulation based
algorithms---Monte Carlo rollout policy and parallel rollout policy are
studied, and various properties for these policies are discussed. We next
consider a restless multi-armed bandit (RMAB) with multi-dimensional state
space and multi-actions bandit model. A standard RMAB consists of two actions
for each arms whereas in multi-actions RMAB, there are more that two actions
for each arms. A popular approach for RMAB is Whittle index based heuristic
policy. Indexability is an important requirement to use index based policy.
Based on this, an RMAB is classified into indexable or non-indexable bandits.
Our interest is in the study of Monte-Carlo rollout policy for both indexable
and non-indexable restless bandits. We first analyze a standard indexable RMAB
(two-action model) and discuss an index based policy approach. We present
approximate index computation algorithm using Monte-Carlo rollout policy. This
algorithm's convergence is shown using two-timescale stochastic approximation
scheme. Later, we analyze multi-actions indexable RMAB, and discuss the index
based policy approach. We also study non-indexable RMAB for both standard and
multi-actions bandits using Monte-Carlo rollout policy.
Related papers
- GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits [16.054685587034836]
GINO-Q is a three-timescale approximation algorithm designed to learn an optimal index policy for restless multi-armed bandit (RMAB)
GINO-Q does not require RMABs to be indexable, enhancing its flexibility and applicability.
Our experimental results demonstrate that GINO-Q consistently learns near optimal policies, even for non-indexable RMABs.
arXiv Detail & Related papers (2024-08-19T10:50:45Z) - Global Rewards in Restless Multi-Armed Bandits [37.918982196934216]
Restless multi-armed bandits (RMAB) extend multi-armed bandits so pulling an arm impacts future states.
Despite the success of RMABs, a key limiting assumption is the separability of rewards into a sum across arms.
We propose restless-multi-armed bandit with global rewards (RMAB-G), a generalization of RMABs to global non-separable rewards.
arXiv Detail & Related papers (2024-06-02T13:13:46Z) - Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - An Optimal Algorithm for the Real-Valued Combinatorial Pure Exploration
of Multi-Armed Bandit [65.268245109828]
We study the real-valued pure exploration problem in the multi-armed bandit (R-CPE-MAB)
Existing methods in the R-CPE-MAB can be seen as a special case of the so-called transductive linear bandits.
We propose an algorithm named the gap-based exploration (CombGapE) algorithm, whose sample complexity matches the lower bound.
arXiv Detail & Related papers (2023-06-15T15:37:31Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Indexability of Finite State Restless Multi-Armed Bandit and Rollout
Policy [5.64327489637232]
We consider finite state restless multi-armed bandit problem.
The classical approach to restless bandits is Whittle index policy.
We propose an alternative approach to verify the indexable criteria for a single armed bandit model.
arXiv Detail & Related papers (2023-04-30T06:53:44Z) - Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z) - Reinforcement Learning for Finite-Horizon Restless Multi-Armed
Multi-Action Bandits [8.136957953239254]
We study a finite-horizon restless multi-armed bandit problem with multiple actions dubbed R(MA)2B.
The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken.
Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy.
arXiv Detail & Related papers (2021-09-20T21:40:12Z) - Q-Learning Lagrange Policies for Multi-Action Restless Bandits [35.022322303796216]
Multi-action restless multi-armed bandits (RMABs) are a powerful framework for constrained resource allocation in which $N$ independent processes are managed.
We design the first algorithms for learning good policies for Multi-action RMABs online using combinations of Lagrangian relaxation and Q-learning.
arXiv Detail & Related papers (2021-06-22T19:20:09Z) - Upper Confidence Bounds for Combining Stochastic Bandits [52.10197476419621]
We provide a simple method to combine bandit algorithms.
Our approach is based on a "meta-UCB" procedure that treats each of $N$ individual bandit algorithms as arms in a higher-level $N$-armed bandit problem.
arXiv Detail & Related papers (2020-12-24T05:36:29Z) - Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB.
We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits.
Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.