Related papers: Q-Learning Lagrange Policies for Multi-Action Restless Bandits

Q-Learning Lagrange Policies for Multi-Action Restless Bandits

URL: http://arxiv.org/abs/2106.12024v1
Date: Tue, 22 Jun 2021 19:20:09 GMT
Title: Q-Learning Lagrange Policies for Multi-Action Restless Bandits
Authors: Jackson A. Killian, Arpita Biswas, Sanket Shah, Milind Tambe
Abstract summary: Multi-action restless multi-armed bandits (RMABs) are a powerful framework for constrained resource allocation in which $N$ independent processes are managed. We design the first algorithms for learning good policies for Multi-action RMABs online using combinations of Lagrangian relaxation and Q-learning.
Score: 35.022322303796216
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-action restless multi-armed bandits (RMABs) are a powerful framework for constrained resource allocation in which $N$ independent processes are managed. However, previous work only study the offline setting where problem dynamics are known. We address this restrictive assumption, designing the first algorithms for learning good policies for Multi-action RMABs online using combinations of Lagrangian relaxation and Q-learning. Our first approach, MAIQL, extends a method for Q-learning the Whittle index in binary-action RMABs to the multi-action setting. We derive a generalized update rule and convergence proof and establish that, under standard assumptions, MAIQL converges to the asymptotically optimal multi-action RMAB policy as $t\rightarrow{}\infty$. However, MAIQL relies on learning Q-functions and indexes on two timescales which leads to slow convergence and requires problem structure to perform well. Thus, we design a second algorithm, LPQL, which learns the well-performing and more general Lagrange policy for multi-action RMABs by learning to minimize the Lagrange bound through a variant of Q-learning. To ensure fast convergence, we take an approximation strategy that enables learning on a single timescale, then give a guarantee relating the approximation's precision to an upper bound of LPQL's return as $t\rightarrow{}\infty$. Finally, we show that our approaches always outperform baselines across multiple settings, including one derived from real-world medication adherence data.

Related papers

Collab-Solver: Collaborative Solving Policy Learning for Mixed-Integer Linear Programming [57.44900640134789]
We propose a novel multi-agent-based policy learning framework for MILP solving as a Stackelberg game.<n>Specifically, we formulate the collaboration of cut selection and branching in MILP solving as a Stackelberg game.<n>The jointly learned policy significantly improves the solving performance on both synthetic and large-scale real-world MILP datasets.
arXiv Detail & Related papers (2025-08-05T03:16:04Z)
GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits [16.054685587034836]
GINO-Q is a three-timescale approximation algorithm designed to learn an optimal index policy for restless multi-armed bandit (RMAB) GINO-Q does not require RMABs to be indexable, enhancing its flexibility and applicability. Our experimental results demonstrate that GINO-Q consistently learns near optimal policies, even for non-indexable RMABs.
arXiv Detail & Related papers (2024-08-19T10:50:45Z)
UCB-driven Utility Function Search for Multi-objective Reinforcement Learning [75.11267478778295]
In Multi-objective Reinforcement Learning (MORL) agents are tasked with optimising decision-making behaviours. We focus on the case of linear utility functions parameterised by weight vectors w. We introduce a method based on Upper Confidence Bound to efficiently search for the most promising weight vectors during different stages of the learning process.
arXiv Detail & Related papers (2024-05-01T09:34:42Z)
Provably Efficient Information-Directed Sampling Algorithms for Multi-Agent Reinforcement Learning [50.92957910121088]
This work designs and analyzes a novel set of algorithms for multi-agent reinforcement learning (MARL) based on the principle of information-directed sampling (IDS) For episodic two-player zero-sum MGs, we present three sample-efficient algorithms for learning Nash equilibrium. We extend Reg-MAIDS to multi-player general-sum MGs and prove that it can learn either the Nash equilibrium or coarse correlated equilibrium in a sample efficient manner.
arXiv Detail & Related papers (2024-04-30T06:48:56Z)
DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning [66.85379279041128]
In this study, we introduce a framework that leverages Dual Queries and Low-rank approximation Re-ranking to automatically select exemplars for in-context learning. DQ-LoRe significantly outperforms prior state-of-the-art methods in the automatic selection of exemplars for GPT-4, enhancing performance from 92.5% to 94.2%.
arXiv Detail & Related papers (2023-10-04T16:44:37Z)
Breaking the Curse of Multiagency: Provably Efficient Decentralized Multi-Agent RL with Function Approximation [44.051717720483595]
This paper presents the first line of MARL algorithms that provably resolve the curse of multiagency approximation. In exchange for learning a weaker version of CCEs, this algorithm applies to a wider range of problems under generic function approximation. Our algorithm always outputs Markov CCEs, and an optimal rate of $widetildemathcalO(epsilon-2)$ for finding $epsilon$-optimal solutions.
arXiv Detail & Related papers (2023-02-13T18:59:25Z)
Addressing the issue of stochastic environments and local decision-making in multi-objective reinforcement learning [0.0]
Multi-objective reinforcement learning (MORL) is a relatively new field which builds on conventional Reinforcement Learning (RL) This thesis focuses on what factors influence the frequency with which value-based MORL Q-learning algorithms learn the optimal policy for an environment.
arXiv Detail & Related papers (2022-11-16T04:56:42Z)
Robust Restless Bandits: Tackling Interval Uncertainty with Deep Reinforcement Learning [31.515757763077065]
We introduce Robust Restless Bandits, a generalization of restless multi-arm bandits (RMAB) We develop solutions for a minimax regret objective when transitions are given by interval uncertainties. We introduce RMABPPO, a novel deep reinforcement learning algorithm for solving RMABs.
arXiv Detail & Related papers (2021-07-04T17:21:26Z)
Provable Multi-Objective Reinforcement Learning with Generative Models [98.19879408649848]
We study the problem of single policy MORL, which learns an optimal policy given the preference of objectives. Existing methods require strong assumptions such as exact knowledge of the multi-objective decision process. We propose a new algorithm called model-based envelop value (EVI) which generalizes the enveloped multi-objective $Q$-learning algorithm.
arXiv Detail & Related papers (2020-11-19T22:35:31Z)
SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms. SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z)
Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning [63.64636047748605]
We develop a new theoretical framework to provide convergence guarantee for the general multi-step MAML algorithm. In particular, our results suggest that an inner-stage step needs to be chosen inversely proportional to $N$ of inner-stage steps in order for $N$ MAML to have guaranteed convergence.
arXiv Detail & Related papers (2020-02-18T19:17:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.