Smoothing Policy Iteration for Zero-sum Markov Games
- URL: http://arxiv.org/abs/2212.01623v1
- Date: Sat, 3 Dec 2022 14:39:06 GMT
- Title: Smoothing Policy Iteration for Zero-sum Markov Games
- Authors: Yangang Ren, Yao Lyu, Wenxuan Wang, Shengbo Eben Li, Zeyang Li,
Jingliang Duan
- Abstract summary: We propose the smoothing policy robustness (SPI) algorithm to solve the zero-sum MGs approximately.
Specially, the adversarial policy is served as the weight function to enable an efficient sampling over action spaces.
We also propose a model-based algorithm called Smooth adversarial Actor-critic (SaAC) by extending SPI with the function approximations.
- Score: 9.158672246275348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zero-sum Markov Games (MGs) has been an efficient framework for multi-agent
systems and robust control, wherein a minimax problem is constructed to solve
the equilibrium policies. At present, this formulation is well studied under
tabular settings wherein the maximum operator is primarily and exactly solved
to calculate the worst-case value function. However, it is non-trivial to
extend such methods to handle complex tasks, as finding the maximum over
large-scale action spaces is usually cumbersome. In this paper, we propose the
smoothing policy iteration (SPI) algorithm to solve the zero-sum MGs
approximately, where the maximum operator is replaced by the weighted LogSumExp
(WLSE) function to obtain the nearly optimal equilibrium policies. Specially,
the adversarial policy is served as the weight function to enable an efficient
sampling over action spaces.We also prove the convergence of SPI and analyze
its approximation error in $\infty -$norm based on the contraction mapping
theorem. Besides, we propose a model-based algorithm called Smooth adversarial
Actor-critic (SaAC) by extending SPI with the function approximations. The
target value related to WLSE function is evaluated by the sampled trajectories
and then mean square error is constructed to optimize the value function, and
the gradient-ascent-descent methods are adopted to optimize the protagonist and
adversarial policies jointly. In addition, we incorporate the
reparameterization technique in model-based gradient back-propagation to
prevent the gradient vanishing due to sampling from the stochastic policies. We
verify our algorithm in both tabular and function approximation settings.
Results show that SPI can approximate the worst-case value function with a high
accuracy and SaAC can stabilize the training process and improve the
adversarial robustness in a large margin.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.