Q-Learning Lagrange Policies for Multi-Action Restless Bandits
- URL: http://arxiv.org/abs/2106.12024v1
- Date: Tue, 22 Jun 2021 19:20:09 GMT
- Title: Q-Learning Lagrange Policies for Multi-Action Restless Bandits
- Authors: Jackson A. Killian, Arpita Biswas, Sanket Shah, Milind Tambe
- Abstract summary: Multi-action restless multi-armed bandits (RMABs) are a powerful framework for constrained resource allocation in which $N$ independent processes are managed.
We design the first algorithms for learning good policies for Multi-action RMABs online using combinations of Lagrangian relaxation and Q-learning.
- Score: 35.022322303796216
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-action restless multi-armed bandits (RMABs) are a powerful framework
for constrained resource allocation in which $N$ independent processes are
managed. However, previous work only study the offline setting where problem
dynamics are known. We address this restrictive assumption, designing the first
algorithms for learning good policies for Multi-action RMABs online using
combinations of Lagrangian relaxation and Q-learning. Our first approach,
MAIQL, extends a method for Q-learning the Whittle index in binary-action RMABs
to the multi-action setting. We derive a generalized update rule and
convergence proof and establish that, under standard assumptions, MAIQL
converges to the asymptotically optimal multi-action RMAB policy as
$t\rightarrow{}\infty$. However, MAIQL relies on learning Q-functions and
indexes on two timescales which leads to slow convergence and requires problem
structure to perform well. Thus, we design a second algorithm, LPQL, which
learns the well-performing and more general Lagrange policy for multi-action
RMABs by learning to minimize the Lagrange bound through a variant of
Q-learning. To ensure fast convergence, we take an approximation strategy that
enables learning on a single timescale, then give a guarantee relating the
approximation's precision to an upper bound of LPQL's return as
$t\rightarrow{}\infty$. Finally, we show that our approaches always outperform
baselines across multiple settings, including one derived from real-world
medication adherence data.
Related papers
- GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits [16.054685587034836]
GINO-Q is a three-timescale approximation algorithm designed to learn an optimal index policy for restless multi-armed bandit (RMAB)
GINO-Q does not require RMABs to be indexable, enhancing its flexibility and applicability.
Our experimental results demonstrate that GINO-Q consistently learns near optimal policies, even for non-indexable RMABs.
arXiv Detail & Related papers (2024-08-19T10:50:45Z) - UCB-driven Utility Function Search for Multi-objective Reinforcement Learning [75.11267478778295]
In Multi-objective Reinforcement Learning (MORL) agents are tasked with optimising decision-making behaviours.
We focus on the case of linear utility functions parameterised by weight vectors w.
We introduce a method based on Upper Confidence Bound to efficiently search for the most promising weight vectors during different stages of the learning process.
arXiv Detail & Related papers (2024-05-01T09:34:42Z) - Provably Efficient Information-Directed Sampling Algorithms for Multi-Agent Reinforcement Learning [50.92957910121088]
This work designs and analyzes a novel set of algorithms for multi-agent reinforcement learning (MARL) based on the principle of information-directed sampling (IDS)
For episodic two-player zero-sum MGs, we present three sample-efficient algorithms for learning Nash equilibrium.
We extend Reg-MAIDS to multi-player general-sum MGs and prove that it can learn either the Nash equilibrium or coarse correlated equilibrium in a sample efficient manner.
arXiv Detail & Related papers (2024-04-30T06:48:56Z) - DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for
In-Context Learning [66.85379279041128]
In this study, we introduce a framework that leverages Dual Queries and Low-rank approximation Re-ranking to automatically select exemplars for in-context learning.
DQ-LoRe significantly outperforms prior state-of-the-art methods in the automatic selection of exemplars for GPT-4, enhancing performance from 92.5% to 94.2%.
arXiv Detail & Related papers (2023-10-04T16:44:37Z) - Breaking the Curse of Multiagency: Provably Efficient Decentralized
Multi-Agent RL with Function Approximation [44.051717720483595]
This paper presents the first line of MARL algorithms that provably resolve the curse of multiagency approximation.
In exchange for learning a weaker version of CCEs, this algorithm applies to a wider range of problems under generic function approximation.
Our algorithm always outputs Markov CCEs, and an optimal rate of $widetildemathcalO(epsilon-2)$ for finding $epsilon$-optimal solutions.
arXiv Detail & Related papers (2023-02-13T18:59:25Z) - Addressing the issue of stochastic environments and local
decision-making in multi-objective reinforcement learning [0.0]
Multi-objective reinforcement learning (MORL) is a relatively new field which builds on conventional Reinforcement Learning (RL)
This thesis focuses on what factors influence the frequency with which value-based MORL Q-learning algorithms learn the optimal policy for an environment.
arXiv Detail & Related papers (2022-11-16T04:56:42Z) - Robust Restless Bandits: Tackling Interval Uncertainty with Deep
Reinforcement Learning [31.515757763077065]
We introduce Robust Restless Bandits, a generalization of restless multi-arm bandits (RMAB)
We develop solutions for a minimax regret objective when transitions are given by interval uncertainties.
We introduce RMABPPO, a novel deep reinforcement learning algorithm for solving RMABs.
arXiv Detail & Related papers (2021-07-04T17:21:26Z) - Provable Multi-Objective Reinforcement Learning with Generative Models [98.19879408649848]
We study the problem of single policy MORL, which learns an optimal policy given the preference of objectives.
Existing methods require strong assumptions such as exact knowledge of the multi-objective decision process.
We propose a new algorithm called model-based envelop value (EVI) which generalizes the enveloped multi-objective $Q$-learning algorithm.
arXiv Detail & Related papers (2020-11-19T22:35:31Z) - SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep
Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms.
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z) - Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning [63.64636047748605]
We develop a new theoretical framework to provide convergence guarantee for the general multi-step MAML algorithm.
In particular, our results suggest that an inner-stage step needs to be chosen inversely proportional to $N$ of inner-stage steps in order for $N$ MAML to have guaranteed convergence.
arXiv Detail & Related papers (2020-02-18T19:17:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.