Fugu-MT 論文翻訳(概要): Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

論文の概要: Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

arxiv url: http://arxiv.org/abs/2007.07461v3
Date: Tue, 8 Aug 2023 22:36:08 GMT
ステータス: 翻訳完了
システム内更新日: 2023-08-10 18:38:23.627247
Title: Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity
Title（参考訳）: 準最適サンプル複素数を持つゼロサムマルコフゲームにおけるモデルベースマルチエージェントRL
Authors: Kaiqing Zhang, Sham M. Kakade, Tamer Ba\c{s}ar, Lin F. Yang
Abstract要約: モデルに基づくMARLは、Nash平衡値(NE)を求めるために$tilde O(|S||B|(gamma)-3epsilon-2)$のサンプル複雑性を実現する。また、アルゴリズムが報酬に依存しない場合、そのようなサンプル境界は最小値(対数因子まで)であり、アルゴリズムは報酬知識のない遷移サンプルを問合せする。
参考スコア（独自算出の注目度）: 67.02490430380415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, our goal is to address the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of $\tilde O(|S||A||B|(1-\gamma)^{-3}\epsilon^{-2})$ for finding the Nash equilibrium (NE) value up to some $\epsilon$ error, and the $\epsilon$-NE policies with a smooth planning oracle, where $\gamma$ is the discount factor, and $S,A,B$ denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward-aware setting, with a $\tilde\Omega(|S|(|A|+|B|)(1-\gamma)^{-3}\epsilon^{-2})$ lower bound, where this model-based approach is near-optimal with only a gap on the $|A|,|B|$ dependence. Our results not only demonstrate the sample-efficiency of this basic model-based approach in MARL, but also elaborate on the fundamental tradeoff between its power (easily handling the more challenging reward-agnostic case) and limitation (less adaptive and suboptimal in $|A|,|B|$), particularly arises in the multi-agent context.
Abstract（参考訳）: 実験モデルを用いたモデルベース強化学習(RL)は,RLのコーナーストーンの1つとして長年認識されてきた。学習と計画段階を自然に分離するマルチエージェントrl(marl)に特に適しており、全てのエージェントがサンプルを使用してポリシーを同時に改善する場合、非定常問題を回避する。直感的で広く使われているが、モデルベースMARLアルゴリズムのサンプル複雑性は十分に研究されていない。本稿では,サンプルの複雑さに関する根本的な問題に対処することを目的とする。生成モデルにのみアクセス可能な2プレイヤーのゼロサムマルコフゲームについて,最も基本的なMARL設定について検討した。モデルベースMARLは、Nash平衡値(NE)を求めるために$\tilde O(|S||A|||B|(1-\gamma)^{-3}\epsilon^{-2})$と、滑らかな計画オラクルを持つ$\epsilon$-NEポリシーのサンプル複雑性を達成し、$\gamma$は割引係数であり、$S,A,B$は状態空間と2つのエージェントのアクション空間を表す。さらに,アルゴリズムが報酬に依存しない場合,そのようなサンプル境界がミニマックス最適(対数係数まで)であることが示され,アルゴリズムは報酬知識のない遷移サンプルを検索し,一致した下位境界を確立する。これは通常の報酬対応の設定とは対照的で、$\tilde\Omega(|S|(|A|+|B|)(1-\gamma)^{-3}\epsilon^{-2})$ lower bound である。今回の結果は,marlにおけるモデルベースアプローチのサンプル効率を示すだけでなく,そのパワー(より困難な報酬非依存のケースを簡易に処理する)と制限($|a|,|b|$の適応的かつ最適でない)との根本的なトレードオフを詳細に示すものである。

論文の概要: Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

関連論文リスト