Model-Free Learning and Optimal Policy Design in Multi-Agent MDPs Under Probabilistic Agent Dropout
- URL: http://arxiv.org/abs/2304.12458v2
- Date: Sun, 22 Sep 2024 17:10:12 GMT
- Title: Model-Free Learning and Optimal Policy Design in Multi-Agent MDPs Under Probabilistic Agent Dropout
- Authors: Carmel Fiscko, Soummya Kar, Bruno Sinopoli,
- Abstract summary: This work studies a multi-agent Markov decision process (MDP) that can undergo agent dropout and the computation of policies for the post-dropout system.
We first show that the value of the expected post-dropout system can be represented by a single MDP.
More significantly, in a model-free context, it is shown that the robust MDP value can be estimated with samples generated by the pre-dropout system.
- Score: 4.949881799107061
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work studies a multi-agent Markov decision process (MDP) that can undergo agent dropout and the computation of policies for the post-dropout system based on control and sampling of the pre-dropout system. The central planner's objective is to find an optimal policy that maximizes the value of the expected system given a priori knowledge of the agents' dropout probabilities. For MDPs with a certain transition independence and reward separability structure, we assume that removing agents from the system forms a new MDP comprised of the remaining agents with new state and action spaces, transition dynamics that marginalize the removed agents, and rewards that are independent of the removed agents. We first show that under these assumptions, the value of the expected post-dropout system can be represented by a single MDP; this "robust MDP" eliminates the need to evaluate all $2^N$ realizations of the system, where N denotes the number of agents. More significantly, in a model-free context, it is shown that the robust MDP value can be estimated with samples generated by the pre-dropout system, meaning that robust policies can be found before dropout occurs. This fact is used to propose a policy importance sampling (IS) routine that performs policy evaluation for dropout scenarios while controlling the existing system with good pre-dropout policies. The policy IS routine produces value estimates for both the robust MDP and specific post-dropout system realizations and is justified with exponential confidence bounds. Finally, the utility of this approach is verified in simulation, showing how structural properties of agent dropout can help a controller find good post-dropout policies before dropout occurs.
Related papers
- Deontically Constrained Policy Improvement in Reinforcement Learning Agents [0.0]
Markov Decision Processes (MDPs) are the most common model for decision making under uncertainty in the Machine Learning community.<n>An MDP captures non-determinism, probabilistic uncertainty, and an explicit model of action.<n>A Reinforcement Learning (RL) agent learns to act in an MDP by maximizing a utility function.
arXiv Detail & Related papers (2025-06-08T01:01:06Z) - Simulation-Based Optimistic Policy Iteration For Multi-Agent MDPs with Kullback-Leibler Control Cost [3.9052860539161918]
This paper proposes an agent-based optimistic policy (OPI) scheme for learning stationary optimal policies in Markov Decision Processes (MDPs)
The proposed scheme consists of a greedy policy improvement step followed by an m-step temporal difference (TD) policy evaluation step.
We show that both the synchronous (entire state space evaluation) and asynchronous (a uniformly sampled set of substates) versions of the OPI scheme converge to the optimal value function and an optimal joint policy rollout.
arXiv Detail & Related papers (2024-10-19T17:00:23Z) - Conformal Off-Policy Prediction for Multi-Agent Systems [6.32674891108819]
Off-Policy Prediction (OPP) is a paramount problem in data-driven analysis of safety-critical systems.
We introduce MA-COPP, the first conformal prediction method to solve OPP problems involving multi-agent systems.
A key contribution of MA-COPP is to avoid enumeration or exhaustive search of the output space of agent trajectories.
arXiv Detail & Related papers (2024-03-25T15:37:43Z) - Off-Policy Evaluation in Markov Decision Processes under Weak
Distributional Overlap [5.0401589279256065]
We re-visit the task of off-policy evaluation in Markov decision processes (MDPs) under a weaker notion of distributional overlap.
We introduce a class of truncated doubly robust (TDR) estimators which we find to perform well in this setting.
arXiv Detail & Related papers (2024-02-13T03:55:56Z) - Predictable MDP Abstraction for Unsupervised Model-Based RL [93.91375268580806]
We propose predictable MDP abstraction (PMA)
Instead of training a predictive model on the original MDP, we train a model on a transformed MDP with a learned action space.
We theoretically analyze PMA and empirically demonstrate that PMA leads to significant improvements over prior unsupervised model-based RL approaches.
arXiv Detail & Related papers (2023-02-08T07:37:51Z) - Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z) - Solving the non-preemptive two queue polling model with generally
distributed service and switch-over durations and Poisson arrivals as a
Semi-Markov Decision Process [0.0]
The polling system with switch-over durations is a useful model with several practical applications.
It is classified as a Discrete Event Dynamic System (DEDS) for which no one agreed upon modelling approach exists.
This paper presents a Semi-Markov Decision Process (SMDP) formulation of the polling system as to introduce additional modelling power.
arXiv Detail & Related papers (2021-12-13T11:40:55Z) - Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in
Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors.
Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP)
We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z) - Modular Deep Reinforcement Learning for Continuous Motion Planning with
Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP)
The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP.
The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z) - RL for Latent MDPs: Regret Guarantees and a Lower Bound [74.41782017817808]
We consider the regret problem for reinforcement learning in latent Markov Decision Processes (LMDP)
In an LMDP, an MDP is randomly drawn from a set of $M$ possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent.
We show that the key link is a notion of separation between the MDP system dynamics.
arXiv Detail & Related papers (2021-02-09T16:49:58Z) - Exploiting Submodular Value Functions For Scaling Up Active Perception [60.81276437097671]
In active perception tasks, agent aims to select sensory actions that reduce uncertainty about one or more hidden variables.
Partially observable Markov decision processes (POMDPs) provide a natural model for such problems.
As the number of sensors available to the agent grows, the computational cost of POMDP planning grows exponentially.
arXiv Detail & Related papers (2020-09-21T09:11:36Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.