Related papers: Neural Index Policies for Restless Multi-Action Bandits with Heterogeneous Budgets

Neural Index Policies for Restless Multi-Action Bandits with Heterogeneous Budgets

URL: http://arxiv.org/abs/2510.22069v1
Date: Fri, 24 Oct 2025 23:08:36 GMT
Title: Neural Index Policies for Restless Multi-Action Bandits with Heterogeneous Budgets
Authors: Himadri S. Pandey, Kai Wang, Gian-Gabriel P. Garcia,
Abstract summary: We introduce a Neural Index Policy (NIP) for multi-action RMABs with heterogeneous budget constraints.<n>NIP unifies index prediction and constrained optimization in a single end-to-end differentiable framework.<n> Empirically, NIP achieves near-optimal performance within 5% of the occupancy oracle-measure policy.
Score: 2.9059410824803655
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Restless multi-armed bandits (RMABs) provide a scalable framework for sequential decision-making under uncertainty, but classical formulations assume binary actions and a single global budget. Real-world settings, such as healthcare, often involve multiple interventions with heterogeneous costs and constraints, where such assumptions break down. We introduce a Neural Index Policy (NIP) for multi-action RMABs with heterogeneous budget constraints. Our approach learns to assign budget-aware indices to arm--action pairs using a neural network, and converts them into feasible allocations via a differentiable knapsack layer formulated as an entropy-regularized optimal transport (OT) problem. The resulting model unifies index prediction and constrained optimization in a single end-to-end differentiable framework, enabling gradient-based training directly on decision quality. The network is optimized to align its induced occupancy measure with the theoretical upper bound from a linear programming relaxation, bridging asymptotic RMAB theory with practical learning. Empirically, NIP achieves near-optimal performance within 5% of the oracle occupancy-measure policy while strictly enforcing heterogeneous budgets and scaling to hundreds of arms. This work establishes a general, theoretically grounded, and scalable framework for learning index-based policies in complex resource-constrained environments.

Related papers

Multi-Task Vehicle Routing Solver via Mixture of Specialized Experts under State-Decomposable MDP [57.28979643999352]
We propose a framework that enables unified solvers to perceive the shared-component nature across VRP variants.<n>We introduce a State-Decomposable MDP (SDMDP) that reformulates VRPs by expressing the state space as the Cartesian product of basis state spaces.<n>A Latent Space-based SDMDP extension is developed by incorporating both the optimal basis policies and a learnable mixture function.
arXiv Detail & Related papers (2025-10-24T13:31:31Z)
NDCG-Consistent Softmax Approximation with Accelerated Convergence [67.10365329542365]
We propose novel loss formulations that align directly with ranking metrics.<n>We integrate the proposed RG losses with the highly efficient Alternating Least Squares (ALS) optimization method.<n> Empirical evaluations on real-world datasets demonstrate that our approach achieves comparable or superior ranking performance.
arXiv Detail & Related papers (2025-06-11T06:59:17Z)
Asymptotically Optimal Linear Best Feasible Arm Identification with Fixed Budget [55.938644481736446]
We introduce a novel algorithm for best feasible arm identification that guarantees an exponential decay in the error probability.<n>We validate our algorithm through comprehensive empirical evaluations across various problem instances with different levels of complexity.
arXiv Detail & Related papers (2025-06-03T02:56:26Z)
Beyond Non-Degeneracy: Revisiting Certainty Equivalent Heuristic for Online Linear Programming [18.371947752008744]
We show that Certainty Equivalent achieves uniformly near-optimal regret under mild assumptions on the underlying distribution.<n>Our result implies that, contrary to prior belief, CE effectively beats the curse of degeneracy for a wide range of problem instances.<n>These techniques may find potential applications in broader online decision-making contexts.
arXiv Detail & Related papers (2025-01-03T09:21:27Z)
Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization [10.465789490644031]
We propose a novel framework for robust regularized Markov decision process ($d$-RRMDP)<n>For the offline RL setting, we develop a family of algorithms, Robust Regularized Pessimistic Value Iteration (R2PVI)
arXiv Detail & Related papers (2024-11-27T18:57:03Z)
Achieving $\tilde{\mathcal{O}}(1/N)$ Optimality Gap in Restless Bandits through Gaussian Approximation [21.34216861973257]
We study the finite-horizon MultiArmed Bandit (RMAB) problem with $N$ homogeneous arms.<n>Our approach is based on the construction of a Gaussian system that captures not only the mean but also the variance of the RMAB dynamics.<n>This is the first result to establish an $tildemathcalO (1/N)$ optimality gap for degenerate RMABs.
arXiv Detail & Related papers (2024-10-19T06:29:18Z)
GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits [16.054685587034836]
GINO-Q is a three-timescale approximation algorithm designed to learn an optimal index policy for restless multi-armed bandit (RMAB)<n>GINO-Q does not require RMABs to be indexable, enhancing its flexibility and applicability.<n>Our experimental results demonstrate that GINO-Q consistently learns near optimal policies, even for non-indexable RMABs.
arXiv Detail & Related papers (2024-08-19T10:50:45Z)
Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems [53.03951222945921]
We analyze smoothed (perturbed) policies, adding controlled random perturbations to the direction used by the linear oracle.<n>Our main contribution is a generalization bound that decomposes the excess risk into perturbation bias, statistical estimation error, and optimization error.<n>We illustrate the scope of the results on applications such as vehicle scheduling, highlighting how smoothing enables both tractable training and controlled generalization.
arXiv Detail & Related papers (2024-07-24T12:00:30Z)
Implicit Generative Prior for Bayesian Neural Networks [8.013264410621357]
We propose a novel neural adaptive empirical Bayes (NA-EB) framework for complex data structures. The proposed NA-EB framework combines variational inference with a gradient ascent algorithm. We demonstrate the practical applications of our framework through extensive evaluations on a variety of tasks.
arXiv Detail & Related papers (2024-04-27T21:00:38Z)
Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints [26.274786600234876]
The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but amplify safety concerns. RLHF has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and dependence on a separate reward model. DPO has been proposed as an alternative, and it remains equivalent to RLHF under the reverse KL regularization constraint. We show that under certain $f$-divergences, including Jensen-Shannon divergence, forward KL divergences and $alpha$-divergences, the complex relationship between the reward and optimal policy can also be simplified
arXiv Detail & Related papers (2023-09-28T08:29:44Z)
Quantization for decentralized learning under subspace constraints [61.59416703323886]
We consider decentralized optimization problems where agents have individual cost functions to minimize subject to subspace constraints. We propose and study an adaptive decentralized strategy where the agents employ differential randomized quantizers to compress their estimates. The analysis shows that, under some general conditions on the quantization noise, the strategy is stable both in terms of mean-square error and average bit rate.
arXiv Detail & Related papers (2022-09-16T09:38:38Z)
Neural-Progressive Hedging: Enforcing Constraints in Reinforcement Learning with Stochastic Programming [8.942831966541231]
We propose a framework that leverages programming during the online phase of executing a reinforcement learning (RL) policy. The goal is to ensure feasibility with respect to constraints and risk-based objectives such as conditional value-at-risk (CVaR) We show that the NP framework produces policies that are better than deep RL and other baseline approaches.
arXiv Detail & Related papers (2022-02-27T19:39:19Z)
Resource Allocation via Model-Free Deep Learning in Free Space Optical Communications [119.81868223344173]
The paper investigates the general problem of resource allocation for mitigating channel fading effects in Free Space Optical (FSO) communications. Under this framework, we propose two algorithms that solve FSO resource allocation problems.
arXiv Detail & Related papers (2020-07-27T17:38:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.