Lagrangian Relaxation for Multi-Action Partially Observable Restless Bandits: Heuristic Policies and Indexability
- URL: http://arxiv.org/abs/2509.00415v1
- Date: Sat, 30 Aug 2025 08:47:33 GMT
- Title: Lagrangian Relaxation for Multi-Action Partially Observable Restless Bandits: Heuristic Policies and Indexability
- Authors: Rahul Meshram, Kesav Kaza,
- Abstract summary: We study multi-action partially observable restless multi-armed bandits.<n>The state of a bandit is not observable but one of finitely many feedback signals are observable.<n>The computation of optimal value functions for finite-state, finite-action POMDPs is non-trivial.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Partially observable restless multi-armed bandits have found numerous applications including in recommendation systems, communication systems, public healthcare outreach systems, and in operations research. We study multi-action partially observable restless multi-armed bandits, it is a generalization of the classical restless multi-armed bandit problem -- 1) each bandit has finite states, and the current state is not observable, 2) each bandit has finite actions. In particular, we assume that more than two actions are available for each bandit. We motivate our problem with the application of public-health intervention planning. We describe the model and formulate a long term discounted optimization problem, where the state of each bandit evolves according to a Markov process, and this evolution is action dependent. The state of a bandit is not observable but one of finitely many feedback signals are observable. Each bandit yields a reward, based on the action taken on that bandit. The agent is assumed to have a budget constraint. The bandits are assumed to be independent. However, they are weakly coupled at the agent through the budget constraint. We first analyze the Lagrangian bound method for our partially observable restless bandits. The computation of optimal value functions for finite-state, finite-action POMDPs is non-trivial. Hence, the computation of Lagrangian bounds is also challenging. We describe approximations for the computation of Lagrangian bounds using point based value iteration (PBVI) and online rollout policy. We further present various properties of the value functions and provide theoretical insights on PBVI and online rollout policy. We study heuristic policies for multi-actions PORMAB. Finally, we discuss present Whittle index policies and their limitations in our model.
Related papers
- Quick-Draw Bandits: Quickly Optimizing in Nonstationary Environments with Extremely Many Arms [80.05851139852311]
We propose a novel policy to learn reward environments over a continuous space using Gaussian.<n>We show that our method efficiently learns continuous Lipschitz reward functions with $mathcalO*(sqrtT)$ cumulative regret. Furthermore, our method naturally extends to non-stationary problems with a simple modification.
arXiv Detail & Related papers (2025-05-30T15:15:18Z) - Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting [67.1631453378926]
Graph-Triggered Bandits is a framework to generalize rested and restless bandits.
In this work, we focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs.
arXiv Detail & Related papers (2024-09-09T18:23:07Z) - Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback [11.262167746929332]
Restless multi-armed bandits (RMAB) play a central role in modeling sequential decision making problems.
In this paper, we consider the task of learning in episodic RMAB with unknown transition functions and adversarial rewards.
We develop a novel reinforcement learning algorithm with two key contributors.
arXiv Detail & Related papers (2024-05-02T02:20:19Z) - Thompson Exploration with Best Challenger Rule in Best Arm Identification [59.02170783023547]
We study the fixed-confidence best arm identification problem in the bandit framework.<n>We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - Indexability of Finite State Restless Multi-Armed Bandit and Rollout
Policy [5.64327489637232]
We consider finite state restless multi-armed bandit problem.
The classical approach to restless bandits is Whittle index policy.
We propose an alternative approach to verify the indexable criteria for a single armed bandit model.
arXiv Detail & Related papers (2023-04-30T06:53:44Z) - Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.
We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret
We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z) - Combinatorial Blocking Bandits with Stochastic Delays [33.65025386998747]
Recent work has considered natural variations of the multi-armed bandit problem, where the reward of each arm is a special function of the time passed since its last pulling.
In this work, we extend the above model in two directions: (i) We consider the general setting where more than one arms can be played at each round, subject to feasibility constraints.
We provide a tight analysis of the approximation of a natural greedy subset that always plays the maximum expected reward feasible among the available (non-blocked) arms.
When the arms' expected rewards are unknown, we adapt the above algorithm into a bandit, based on
arXiv Detail & Related papers (2021-05-22T02:46:04Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Contextual Blocking Bandits [35.235375147227124]
We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms' mean rewards.
Playing an arm blocks it (across all contexts) for a fixed and known number of future time steps.
We propose a UCB-based variant of the full-information algorithm that guarantees a $mathcalO(log T)$-regret w.r.t. an $alpha$regret strategy in $T time steps, matching the $Omega(log(T)$ lower bound
arXiv Detail & Related papers (2020-03-06T20:34:42Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.