ID policy (with reassignment) is asymptotically optimal for heterogeneous weakly-coupled MDPs
- URL: http://arxiv.org/abs/2502.06072v2
- Date: Thu, 20 Feb 2025 04:25:14 GMT
- Title: ID policy (with reassignment) is asymptotically optimal for heterogeneous weakly-coupled MDPs
- Authors: Xiangcheng Zhang, Yige Hong, Weina Wang,
- Abstract summary: We study the fully heterogeneous setting of weakly-coupled Markov decision processes (WCMDPs)
We show that, under mild assumptions, a natural adaptation of the ID policy, although originally proposed for a homogeneous special case of WCMDPs, achieves an $O (1/sqrtN)$ optimality gap in long-run average reward per arm for fully heterogeneous WCMDPs as $N$ becomes large.
- Score: 3.850666668546735
- License:
- Abstract: Heterogeneity poses a fundamental challenge for many real-world large-scale decision-making problems but remains largely understudied. In this paper, we study the fully heterogeneous setting of a prominent class of such problems, known as weakly-coupled Markov decision processes (WCMDPs). Each WCMDP consists of $N$ arms (or subproblems), which have distinct model parameters in the fully heterogeneous setting, leading to the curse of dimensionality when $N$ is large. We show that, under mild assumptions, a natural adaptation of the ID policy, although originally proposed for a homogeneous special case of WCMDPs, in fact achieves an $O(1/\sqrt{N})$ optimality gap in long-run average reward per arm for fully heterogeneous WCMDPs as $N$ becomes large. This is the first asymptotic optimality result for fully heterogeneous average-reward WCMDPs. Our techniques highlight the construction of a novel projection-based Lyapunov function, which witnesses the convergence of rewards and costs to an optimal region in the presence of heterogeneity.
Related papers
- Mechanism design with multi-armed bandit [8.013444110633223]
A popular approach of automated mechanism design is to formulate a linear program (LP) whose solution gives a mechanism with desired properties.
We analytically derive a class of optimal solutions for such an LP that gives mechanisms achieving standard properties of efficiency, incentive compatibility, strong budget balance (SBB), and individual rationality (IR)
arXiv Detail & Related papers (2024-11-30T03:59:36Z) - Achieving O(1/N) Optimality Gap in Restless Bandits through Diffusion Approximation [21.34216861973257]
We study the finite horizon Restless Multi-Armed Bandit (RMAB) problem with $N$ homogeneous arms.
Our novel diffusion-resolving policy achieves an optimality gap of $O (1/sqrtN)$ relative to the true optimal value, rather than the LP upper bound.
These insights pave the way for constructing policies that surpass the $O (1/sqrtN)$ optimality gap for any RMAB, whether degenerate or not.
arXiv Detail & Related papers (2024-10-19T06:29:18Z) - Sample-Efficient Constrained Reinforcement Learning with General Parameterization [35.22742439337603]
We consider a constrained Markov Decision Problem (CMDP) where the goal of an agent is to maximize the expected discounted sum of rewards over an infinite horizon.
We develop the Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) algorithm that ensures an $epsilon$ global optimality gap and $epsilon$ constraint violation.
arXiv Detail & Related papers (2024-05-17T08:39:05Z) - Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z) - First-order Policy Optimization for Robust Markov Decision Process [40.2022466644885]
We consider the problem of solving robust Markov decision process (MDP)
MDP involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels.
For $(mathbfs,mathbfa)$-rectangular uncertainty sets, we establish several structural observations on the robust objective.
arXiv Detail & Related papers (2022-09-21T18:10:28Z) - DeepHAM: A Global Solution Method for Heterogeneous Agent Models with
Aggregate Shocks [9.088303226909277]
We propose an efficient, reliable, and interpretable global solution method, $textitDeep learning-based algorithm for Heterogeneous Agent Models, DeepHAM$.
arXiv Detail & Related papers (2021-12-29T03:09:19Z) - Faster Algorithm and Sharper Analysis for Constrained Markov Decision
Process [56.55075925645864]
The problem of constrained decision process (CMDP) is investigated, where an agent aims to maximize the expected accumulated discounted reward subject to multiple constraints.
A new utilities-dual convex approach is proposed with novel integration of three ingredients: regularized policy, dual regularizer, and Nesterov's gradient descent dual.
This is the first demonstration that nonconcave CMDP problems can attain the lower bound of $mathcal O (1/epsilon)$ for all complexity optimization subject to convex constraints.
arXiv Detail & Related papers (2021-10-20T02:57:21Z) - Result Diversification by Multi-objective Evolutionary Algorithms with
Theoretical Guarantees [94.72461292387146]
We propose to reformulate the result diversification problem as a bi-objective search problem, and solve it by a multi-objective evolutionary algorithm (EA)
We theoretically prove that the GSEMO can achieve the optimal-time approximation ratio, $1/2$.
When the objective function changes dynamically, the GSEMO can maintain this approximation ratio in running time, addressing the open question proposed by Borodin et al.
arXiv Detail & Related papers (2021-10-18T14:00:22Z) - Permutation Invariant Policy Optimization for Mean-Field Multi-Agent
Reinforcement Learning: A Principled Approach [128.62787284435007]
We propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture.
We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence.
In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors.
arXiv Detail & Related papers (2021-05-18T04:35:41Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.