Fugu-MT 論文翻訳(概要): Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium

論文の概要: Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium

arxiv url: http://arxiv.org/abs/2002.07066v3
Date: Tue, 23 Jun 2020 21:09:42 GMT
ステータス: 翻訳完了
システム内更新日: 2022-12-31 12:43:29.526687
Title: Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium
Title（参考訳）: 関数近似と相関平衡を用いたゼロサム同時モーブマルコフゲーム学習
Authors: Qiaomin Xie, Yudong Chen, Zhaoran Wang, Zhuoran Yang
Abstract要約: 両プレイヤーのゼロサム有限ホライゾンマルコフゲームに対する効率の良い強化学習アルゴリズムを開発した。オフライン環境では、両プレイヤーを制御し、双対性ギャップを最小化してナッシュ平衡を求める。オンライン環境では、任意の相手と対戦する1人のプレイヤーを制御し、後悔を最小限に抑える。
参考スコア（独自算出の注目度）: 116.56359444619441
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games with simultaneous moves. To incorporate function approximation, we consider a family of Markov games where the reward function and transition kernel possess a linear structure. Both the offline and online settings of the problems are considered. In the offline setting, we control both players and aim to find the Nash Equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret. For both settings, we propose an optimistic variant of the least-squares minimax value iteration algorithm. We show that our algorithm is computationally efficient and provably achieves an $\tilde O(\sqrt{d^3 H^3 T} )$ upper bound on the duality gap and regret, where $d$ is the linear dimension, $H$ the horizon and $T$ the total number of timesteps. Our results do not require additional assumptions on the sampling model. Our setting requires overcoming several new challenges that are absent in Markov decision processes or turn-based Markov games. In particular, to achieve optimism with simultaneous moves, we construct both upper and lower confidence bounds of the value function, and then compute the optimistic policy by solving a general-sum matrix game with these bounds as the payoff matrices. As finding the Nash Equilibrium of a general-sum game is computationally hard, our algorithm instead solves for a Coarse Correlated Equilibrium (CCE), which can be obtained efficiently. To our best knowledge, such a CCE-based scheme for optimism has not appeared in the literature and might be of interest in its own right.
Abstract（参考訳）: 同時動作のゼロサム有限ホライゾンマルコフゲームに対して,効率的な強化学習アルゴリズムを開発した。関数近似を組み込むために、報酬関数と遷移カーネルが線形構造を持つマルコフゲーム群を考える。問題のオフライン設定とオンライン設定の両方が考慮されている。オフライン環境では,両プレイヤーを制御し,双対性ギャップを最小化することでナッシュ均衡を求める。オンライン環境では、任意の相手と対戦する1人のプレイヤーを制御し、後悔を最小限に抑える。どちらの設定でも最小二乗最小値反復アルゴリズムの楽観的変種を提案する。このアルゴリズムは計算効率が良く、双対性ギャップと後悔において$\tilde o(\sqrt{d^3 h^3 t} )$上限を達成できることを示し、ここで$d$は線形次元、$h$ the horizon、$t$は時間ステップの総数である。我々の結果はサンプリングモデルに追加の仮定を必要としない。私たちの設定では、マルコフ決定プロセスやターンベースのマルコフゲームに欠けているいくつかの新しい課題を克服する必要があります。特に、同時移動による楽観性を達成するために、値関数の上下の信頼境界を構築し、これらの境界をペイオフ行列として一般サム行列ゲームを解くことで楽観的ポリシーを計算する。一般ゲームにおけるナッシュ平衡の探索は計算が難しいため、我々のアルゴリズムは、効率よく得られる粗相関平衡 (CCE) を解く。我々の知る限りでは、そのようなCCEに基づく楽観主義のスキームは文献に現れておらず、それ自体が関心を持つかもしれない。

論文の概要: Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium

関連論文リスト