Global Rewards in Restless Multi-Armed Bandits
- URL: http://arxiv.org/abs/2406.00738v2
- Date: Fri, 7 Jun 2024 20:38:51 GMT
- Title: Global Rewards in Restless Multi-Armed Bandits
- Authors: Naveen Raman, Zheyuan Ryan Shi, Fei Fang,
- Abstract summary: Restless multi-armed bandits (RMAB) extend multi-armed bandits so pulling an arm impacts future states.
Despite the success of RMABs, a key limiting assumption is the separability of rewards into a sum across arms.
We propose restless-multi-armed bandit with global rewards (RMAB-G), a generalization of RMABs to global non-separable rewards.
- Score: 37.918982196934216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Restless multi-armed bandits (RMAB) extend multi-armed bandits so pulling an arm impacts future states. Despite the success of RMABs, a key limiting assumption is the separability of rewards into a sum across arms. We address this deficiency by proposing restless-multi-armed bandit with global rewards (RMAB-G), a generalization of RMABs to global non-separable rewards. To solve RMAB-G, we develop the Linear- and Shapley-Whittle indices, which extend Whittle indices from RMABs to RMAB-Gs. We prove approximation bounds but also point out how these indices could fail when reward functions are highly non-linear. To overcome this, we propose two sets of adaptive policies: the first computes indices iteratively, and the second combines indices with Monte-Carlo Tree Search (MCTS). Empirically, we demonstrate that our proposed policies outperform baselines and index-based policies with synthetic data and real-world data from food rescue.
Related papers
- Stochastic Bandits for Egalitarian Assignment [58.33714486693828]
We study EgalMAB, an egalitarian assignment problem in the context of multi-armed bandits.
We design and analyze a UCB-based policy EgalUCB and establish upper bounds on the cumulative regret.
arXiv Detail & Related papers (2024-10-08T09:49:47Z) - Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting [67.1631453378926]
Graph-Triggered Bandits is a framework to generalize rested and restless bandits.
In this work, we focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs.
arXiv Detail & Related papers (2024-09-09T18:23:07Z) - GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits [16.054685587034836]
GINO-Q is a three-timescale approximation algorithm designed to learn an optimal index policy for restless multi-armed bandit (RMAB)
GINO-Q does not require RMABs to be indexable, enhancing its flexibility and applicability.
Our experimental results demonstrate that GINO-Q consistently learns near optimal policies, even for non-indexable RMABs.
arXiv Detail & Related papers (2024-08-19T10:50:45Z) - Fairness of Exposure in Online Restless Multi-armed Bandits [8.071147275221973]
Restless multi-armed bandits (RMABs) generalize the multi-armed bandits where each arm exhibits Markovian behavior and transitions according to their transition dynamics.
We show that our algorithm achieves sublinear fairness regret in the single pull case $O(sqrtTln T)$, with $T$ being the total number of episodes. Empirically, we show that our algorithm performs well in the multi-pull scenario as well.
arXiv Detail & Related papers (2024-02-09T11:53:27Z) - Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints [17.403031677689427]
We introduce a new RMAB model with "long-term fairness constraints"
For the online RMAB-F setting, the underlying MDPs associated with each arm are unknown to the DM.
We prove that Fair-UCRL ensures probabilistic sublinear bounds on both the reward regret and the fairness violation regret.
arXiv Detail & Related papers (2023-12-16T03:35:56Z) - Limited Resource Allocation in a Non-Markovian World: The Case of
Maternal and Child Healthcare [27.812174610119452]
We consider the problem of scheduling interventions in low resource settings to increase adherence and/or engagement.
Past works have successfully developed several classes of Restless Multi-armed Bandit (RMAB) based solutions for this problem.
We demonstrate significant deviations from the Markov assumption on real-world data on a maternal health awareness program from our partner NGO, ARMMAN.
To tackle the generalised non-Markovian RMAB setting we (i) model each participant's trajectory as a time-series, (ii) leverage the power of time-series forecasting models to predict future states, and (iii) propose the Time
arXiv Detail & Related papers (2023-05-22T02:26:29Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When
Partial Feedback Counts [53.579515853222986]
We study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB)
This setting is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull.
We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW.
arXiv Detail & Related papers (2022-06-01T15:56:59Z) - DORB: Dynamically Optimizing Multiple Rewards with Bandits [101.68525259222164]
Policy-based reinforcement learning has proven to be a promising approach for optimizing non-differentiable evaluation metrics for language generation tasks.
We use the Exp3 algorithm for bandits and formulate two approaches for bandit rewards: (1) Single Multi-reward Bandit (SM-Bandit); (2) Hierarchical Multi-reward Bandit (HM-Bandit)
We empirically show the effectiveness of our approaches via various automatic metrics and human evaluation on two important NLG tasks.
arXiv Detail & Related papers (2020-11-15T21:57:47Z) - Simulation Based Algorithms for Markov Decision Processes and
Multi-Action Restless Bandits [0.0]
We consider a restless multi-armed bandit (RMAB) with multi-dimensional state space and multi-actions bandit model.
We first analyze a standard indexable RMAB (two-action model) and discuss an index based policy approach.
We present approximate index algorithm using Monte-Carlo rollout policy.
arXiv Detail & Related papers (2020-07-25T13:50:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.