Fugu-MT 論文翻訳(概要): Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games

論文の概要: Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games

arxiv url: http://arxiv.org/abs/2308.08858v2
Date: Wed, 5 Jun 2024 21:24:33 GMT
ステータス: 翻訳完了
システム内更新日: 2024-06-08 00:49:21.102608
Title: Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games
Title（参考訳）: ゼロサムマルコフゲームにおけるモデルフリーアルゴリズムのサンプル効率の改善
Authors: Songtao Feng, Ming Yin, Yu-Xiang Wang, Jing Yang, Yingbin Liang,
Abstract要約: モデルフリーのステージベースQ-ラーニングアルゴリズムはモデルベースアルゴリズムと同じ$H$依存の最適性を享受できることを示す。本アルゴリズムは,楽観的値関数と悲観的値関数のペアとして参照値関数を更新するキーとなる新しい設計を特徴とする。
参考スコア（独自算出の注目度）: 66.2085181793014
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The problem of two-player zero-sum Markov games has recently attracted increasing interests in theoretical studies of multi-agent reinforcement learning (RL). In particular, for finite-horizon episodic Markov decision processes (MDPs), it has been shown that model-based algorithms can find an $\epsilon$-optimal Nash Equilibrium (NE) with the sample complexity of $O(H^3SAB/\epsilon^2)$, which is optimal in the dependence of the horizon $H$ and the number of states $S$ (where $A$ and $B$ denote the number of actions of the two players, respectively). However, none of the existing model-free algorithms can achieve such an optimality. In this work, we propose a model-free stage-based Q-learning algorithm and show that it achieves the same sample complexity as the best model-based algorithm, and hence for the first time demonstrate that model-free algorithms can enjoy the same optimality in the $H$ dependence as model-based algorithms. The main improvement of the dependency on $H$ arises by leveraging the popular variance reduction technique based on the reference-advantage decomposition previously used only for single-agent RL. However, such a technique relies on a critical monotonicity property of the value function, which does not hold in Markov games due to the update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus, to extend such a technique to Markov games, our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions whose value difference is the smallest in the history in order to achieve the desired improvement in the sample efficiency.
Abstract（参考訳）: 近年,マルチエージェント強化学習(RL)の理論研究において,ツープレイヤーゼロサムマルコフゲームの問題が注目されている。特に有限ホライズン・エピソード・マルコフ決定過程(MDPs)では、モデルベースのアルゴリズムは、標本の複雑さが$O(H^3SAB/\epsilon^2)$で、地平線上の$H$と州数$S$(それぞれ$A$と$B$は2人のプレイヤーのアクションの数を表す)の依存性が最適である$O(H^3SAB/\epsilon^2)$を見つけることができる。しかし、既存のモデルフリーアルゴリズムではそのような最適性を達成できない。本研究では,モデルフリーのステージベースQ-ラーニングアルゴリズムを提案し,モデルフリーのアルゴリズムがモデルベースアルゴリズムと同一のサンプル複雑性を達成できることを示し,モデルフリーのアルゴリズムがモデルベースアルゴリズムと同一の最適性を享受できることを初めて示す。 H$への依存性の主な改善は、単一のエージェントRLでしか使われていなかった参照アドバンテージ分解に基づいて、一般的な分散還元技術を活用することで生じる。しかし、そのような手法は値関数の臨界単調性に依存しており、これはマルコフのゲームでは粗相関平衡(CCE)オラクルによるポリシーの更新によって成り立たない。そこで,この手法をマルコフゲームに拡張するために,提案アルゴリズムは,値差が史上最小となる楽観的かつ悲観的な値関数のペアとして参照値関数を更新し,標本効率の向上を期待する鍵となる設計を特徴としている。

論文の概要: Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games

関連論文リスト