Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints
- URL: http://arxiv.org/abs/2312.10303v2
- Date: Fri, 22 Dec 2023 01:40:28 GMT
- Title: Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints
- Authors: Shufan Wang, Guojun Xiong, Jian Li
- Abstract summary: We introduce a new RMAB model with "long-term fairness constraints"
For the online RMAB-F setting, the underlying MDPs associated with each arm are unknown to the DM.
We prove that Fair-UCRL ensures probabilistic sublinear bounds on both the reward regret and the fairness violation regret.
- Score: 17.403031677689427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Restless multi-armed bandits (RMAB) have been widely used to model sequential
decision making problems with constraints. The decision maker (DM) aims to
maximize the expected total reward over an infinite horizon under an
"instantaneous activation constraint" that at most B arms can be activated at
any decision epoch, where the state of each arm evolves stochastically
according to a Markov decision process (MDP). However, this basic model fails
to provide any fairness guarantee among arms. In this paper, we introduce
RMAB-F, a new RMAB model with "long-term fairness constraints", where the
objective now is to maximize the long term reward while a minimum long-term
activation fraction for each arm must be satisfied. For the online RMAB-F
setting (i.e., the underlying MDPs associated with each arm are unknown to the
DM), we develop a novel reinforcement learning (RL) algorithm named Fair-UCRL.
We prove that Fair-UCRL ensures probabilistic sublinear bounds on both the
reward regret and the fairness violation regret. Compared with off-the-shelf RL
methods, our Fair-UCRL is much more computationally efficient since it contains
a novel exploitation that leverages a low-complexity index policy for making
decisions. Experimental results further demonstrate the effectiveness of our
Fair-UCRL.
Related papers
- Best Arm Identification with Minimal Regret [55.831935724659175]
Best arm identification problem elegantly amalgamates regret minimization and BAI.
Agent's goal is to identify the best arm with a prescribed confidence level.
Double KL-UCB algorithm achieves optimality as the confidence level tends to zero.
arXiv Detail & Related papers (2024-09-27T16:46:02Z) - Combinatorial Multivariant Multi-Armed Bandits with Applications to Episodic Reinforcement Learning and Beyond [58.39457881271146]
We introduce a novel framework of multi-armed bandits (CMAB) with multivariant and probabilistically triggering arms (CMAB-MT)
Compared with existing CMAB works, CMAB-MT not only enhances the modeling power but also allows improved results by leveraging distinct statistical properties for multivariant random variables.
Our framework can include many important problems as applications, such as episodic reinforcement learning (RL) and probabilistic maximum coverage for goods distribution.
arXiv Detail & Related papers (2024-06-03T14:48:53Z) - More Benefits of Being Distributional: Second-Order Bounds for
Reinforcement Learning [58.626683114119906]
We show that Distributional Reinforcement Learning (DistRL) can obtain second-order bounds in both online and offline RL.
Our results are the first second-order bounds for low-rank MDPs and for offline RL.
arXiv Detail & Related papers (2024-02-11T13:25:53Z) - Fairness of Exposure in Online Restless Multi-armed Bandits [8.071147275221973]
Restless multi-armed bandits (RMABs) generalize the multi-armed bandits where each arm exhibits Markovian behavior and transitions according to their transition dynamics.
We show that our algorithm achieves sublinear fairness regret in the single pull case $O(sqrtTln T)$, with $T$ being the total number of episodes. Empirically, we show that our algorithm performs well in the multi-pull scenario as well.
arXiv Detail & Related papers (2024-02-09T11:53:27Z) - AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline
Multi-Agent RL via Alternating Stationary Distribution Correction Estimation [65.4532392602682]
One of the main challenges in offline Reinforcement Learning (RL) is the distribution shift that arises from the learned policy deviating from the data collection policy.
This is often addressed by avoiding out-of-distribution (OOD) actions during policy improvement as their presence can lead to substantial performance degradation.
We introduce AlberDICE, an offline MARL algorithm that performs centralized training of individual agents based on stationary distribution optimization.
arXiv Detail & Related papers (2023-11-03T18:56:48Z) - Towards Soft Fairness in Restless Multi-Armed Bandits [8.140037969280716]
Restless multi-armed bandits (RMAB) is a framework for allocating limited resources under uncertainty.
To avoid starvation in the executed interventions across individuals/regions/communities, we first provide a soft fairness constraint.
We then provide an approach to enforce the soft fairness constraint in RMABs.
arXiv Detail & Related papers (2022-07-27T07:56:32Z) - Efficient Resource Allocation with Fairness Constraints in Restless
Multi-Armed Bandits [8.140037969280716]
Restless Multi-Armed Bandits (RMAB) is an apt model to represent decision-making problems in public health interventions.
In this paper, we are interested in ensuring that RMAB decision making is also fair to different arms while maximizing expected value.
arXiv Detail & Related papers (2022-06-08T13:28:29Z) - Learning in Restless Bandits under Exogenous Global Markov Process [13.836565669337057]
We consider an extension to the restless multi-armed bandit (RMAB) problem with unknown arm dynamics.
Under each global state, the rewards process of each arm evolves according to an unknown Markovian rule.
We develop the Learning under Exogenous Markov Process (LEMP) algorithm to minimize regret.
arXiv Detail & Related papers (2021-12-17T12:47:30Z) - Achieving the Pareto Frontier of Regret Minimization and Best Arm
Identification in Multi-Armed Bandits [91.8283876874947]
We design and analyze the BoBW-lil'UCB$(gamma)$ algorithm.
We show that (i) no algorithm can simultaneously perform optimally for both the RM and BAI objectives.
We also show that BoBW-lil'UCB$(gamma)$ outperforms a competitor in terms of the time complexity and the regret.
arXiv Detail & Related papers (2021-10-16T17:52:32Z) - Restless-UCB, an Efficient and Low-complexity Algorithm for Online
Restless Bandits [61.490254407420906]
We study the online restless bandit problem, where the state of each arm evolves according to a Markov chain.
We propose Restless-UCB, a learning policy that follows the explore-then-commit framework.
arXiv Detail & Related papers (2020-11-05T05:16:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.