HSVI-based Online Minimax Strategies for Partially Observable Stochastic Games with Neural Perception Mechanisms
- URL: http://arxiv.org/abs/2404.10679v1
- Date: Tue, 16 Apr 2024 15:58:20 GMT
- Title: HSVI-based Online Minimax Strategies for Partially Observable Stochastic Games with Neural Perception Mechanisms
- Authors: Rui Yan, Gabriel Santos, Gethin Norman, David Parker, Marta Kwiatkowska,
- Abstract summary: We consider a variant of continuous-state partially-observable games with neural perception mechanisms and an asymmetric information structure.
One agent has partial information, while the other agent is assumed to have full knowledge of the state.
We present an efficient online method to compute an $varepsilon$-minimax strategy profile for each agent.
- Score: 31.51588071503617
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider a variant of continuous-state partially-observable stochastic games with neural perception mechanisms and an asymmetric information structure. One agent has partial information, with the observation function implemented as a neural network, while the other agent is assumed to have full knowledge of the state. We present, for the first time, an efficient online method to compute an $\varepsilon$-minimax strategy profile, which requires only one linear program to be solved for each agent at every stage, instead of a complex estimation of opponent counterfactual values. For the partially-informed agent, we propose a continual resolving approach which uses lower bounds, pre-computed offline with heuristic search value iteration (HSVI), instead of opponent counterfactual values. This inherits the soundness of continual resolving at the cost of pre-computing the bound. For the fully-informed agent, we propose an inferred-belief strategy, where the agent maintains an inferred belief about the belief of the partially-informed agent based on (offline) upper bounds from HSVI, guaranteeing $\varepsilon$-distance to the value of the game at the initial belief known to both agents.
Related papers
- Stochastic Principal-Agent Problems: Efficient Computation and Learning [25.637633553882985]
A principal and an agent interact in a environment, each privy to observations about the state not available to the other.
The model encompasses as special cases extensive-form games (EFGs) and approaches games of Markov decision processes (POMDPs)
We show an efficient algorithm for an episodic reinforcement learning setting where transition probabilities are unknown.
arXiv Detail & Related papers (2023-06-06T16:20:44Z) - Cooperative Thresholded Lasso for Sparse Linear Bandit [6.52540785559241]
We present a novel approach to address the multi-agent sparse contextual linear bandit problem.
It is first algorithm that tackles row-wise distributed data in sparse linear bandits.
It is widely applicable to high-dimensional multi-agent problems where efficient feature extraction is critical for minimizing regret.
arXiv Detail & Related papers (2023-05-30T16:05:44Z) - Provably Efficient Neural Offline Reinforcement Learning via Perturbed
Rewards [33.88533898709351]
VIPeR amalgamates the randomized value function idea with the pessimism principle.
It implicitly obtains pessimism by simply perturbing the offline data multiple times.
It is both provably and computationally efficient in general Markov decision processes (MDPs) with neural network function approximation.
arXiv Detail & Related papers (2023-02-24T17:52:12Z) - Provably Efficient Fictitious Play Policy Optimization for Zero-Sum
Markov Games with Structured Transitions [145.54544979467872]
We propose and analyze new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions.
We prove tight $widetildemathcalO(sqrtK)$ regret bounds after $K$ episodes in a two-agent competitive game scenario.
Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization.
arXiv Detail & Related papers (2022-07-25T18:29:16Z) - Approximating Discontinuous Nash Equilibrial Values of Two-Player
General-Sum Differential Games [21.291449080239673]
This paper extends from previous SOTA on zero-sum games with continuous values to general-sum games with discontinuous values, where the discontinuity is caused by that of the players' losses.
We show that due to its lack of convergence proof and generalization analysis on discontinuous losses, the existing self-supervised learning technique fails to generalize and raises safety concerns in an autonomous driving application.
Our solution is to first pre-train the value network on supervised Nash equilibria, and then refine it by minimizing a loss that combines the supervised data with the PDE and boundary conditions.
arXiv Detail & Related papers (2022-07-05T02:22:05Z) - Multi-Agent Neural Rewriter for Vehicle Routing with Limited Disclosure
of Costs [65.23158435596518]
Solving the multi-vehicle routing problem as a team Markov game with partially observable costs.
Our multi-agent reinforcement learning approach, the so-called multi-agent Neural Rewriter, builds on the single-agent Neural Rewriter to solve the problem by iteratively rewriting solutions.
arXiv Detail & Related papers (2022-06-13T09:17:40Z) - Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret
Learning in Markov Games [95.10091348976779]
We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents.
We propose a new algorithm, underlineDecentralized underlineOptimistic hypeunderlineRpolicy munderlineIrror deunderlineScent (DORIS)
DORIS achieves $sqrtK$-regret in the context of general function approximation, where $K$ is the number of episodes.
arXiv Detail & Related papers (2022-06-03T14:18:05Z) - Byzantine-Robust Online and Offline Distributed Reinforcement Learning [60.970950468309056]
We consider a distributed reinforcement learning setting where multiple agents explore the environment and communicate their experiences through a central server.
$alpha$-fraction of agents are adversarial and can report arbitrary fake information.
We seek to identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents.
arXiv Detail & Related papers (2022-06-01T00:44:53Z) - Bandit Linear Optimization for Sequential Decision Making and
Extensive-Form Games [102.23975166536326]
Tree-form sequential decision making (TFSDM) extends classical one-shot decision making by modeling tree-form interactions between an agent and a potentially adversarial environment.
It captures the online decision-making problems that each player faces in an extensive-form game, as well as Markov decision processes and partially-observable Markov decision processes where the agent conditions on observed history.
In this paper, we give the first algorithm for the bandit linear optimization problem for dilatedDM that offers both (i) linear-time losses and (ii) $O(sqrtT)$ cumulative regret in
arXiv Detail & Related papers (2021-03-08T05:00:13Z) - Maximizing Information Gain in Partially Observable Environments via
Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent.
We derive the exact error between negative entropy and the expected prediction reward.
This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.