Fugu-MT 論文翻訳(概要): When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?

論文の概要: When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?

arxiv url: http://arxiv.org/abs/2110.04184v1
Date: Fri, 8 Oct 2021 15:06:22 GMT
ステータス: 翻訳完了
システム内更新日: 2021-10-11 14:38:18.677733
Title: When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?
Title（参考訳）: 多数のプレイヤーがサンプル効率のよい汎用的なマルコフゲームをいつ学べるのか?
Authors: Ziang Song, Song Mei, Yu Bai
Abstract要約: 本稿では,m$-player General-sum Markovゲームの設定において,学習目標がより優れたサンプル複雑性を許容するものについて検討する。まず,$epsilon$-Coarse Correlated Equilibrium (CCE)を$widetildemathcalO(H5Smax_ile m A_i / epsilon2)$ episodesで学習するアルゴリズムを設計する。第二に、マルコフポテンシャルゲームの重要な特別な場合を考え、$widet内で$epsilon$-approximate Nash平衡を学ぶアルゴリズムを設計する。
参考スコア（独自算出の注目度）: 10.397170312149067
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-agent reinforcement learning has made substantial empirical progresses in solving games with a large number of players. However, theoretically, the best known sample complexity for finding a Nash equilibrium in general-sum games scales exponentially in the number of players due to the size of the joint action space, and there is a matching exponential lower bound. This paper investigates what learning goals admit better sample complexities in the setting of $m$-player general-sum Markov games with $H$ steps, $S$ states, and $A_i$ actions per player. First, we design algorithms for learning an $\epsilon$-Coarse Correlated Equilibrium (CCE) in $\widetilde{\mathcal{O}}(H^5S\max_{i\le m} A_i / \epsilon^2)$ episodes, and an $\epsilon$-Correlated Equilibrium (CE) in $\widetilde{\mathcal{O}}(H^6S\max_{i\le m} A_i^2 / \epsilon^2)$ episodes. This is the first line of results for learning CCE and CE with sample complexities polynomial in $\max_{i\le m} A_i$. Our algorithm for learning CE integrates an adversarial bandit subroutine which minimizes a weighted swap regret, along with several novel designs in the outer loop. Second, we consider the important special case of Markov Potential Games, and design an algorithm that learns an $\epsilon$-approximate Nash equilibrium within $\widetilde{\mathcal{O}}(S\sum_{i\le m} A_i / \epsilon^3)$ episodes (when only highlighting the dependence on $S$, $A_i$, and $\epsilon$), which only depends linearly in $\sum_{i\le m} A_i$ and significantly improves over the best known algorithm in the $\epsilon$ dependence. Overall, our results shed light on what equilibria or structural assumptions on the game may enable sample-efficient learning with many players.
Abstract（参考訳）: マルチエージェント強化学習は,多数のプレーヤによるゲーム解決において,実証的な進歩を遂げている。しかし理論的には、一般的なサムゲームでナッシュ均衡を見つけるための最もよく知られたサンプル複雑性は、ジョイントアクション空間の大きさによってプレイヤーの数に指数関数的にスケールし、一致する指数的下界が存在する。本稿では,$m$-player general-sum Markov game with $H$ steps, $S$ states, $A_i$ action の設定において,学習目標がより複雑なサンプルを許容するかどうかを検討する。まず、$\widetilde{\mathcal{O}}(H^5S\max_{i\le m} A_i / \epsilon^2)$と$\epsilon$-Correlated Equilibrium(CE)$\widetilde{\mathcal{O}}(H^6S\max_{i\le m} A_i^2 / \epsilon^2)$を学習するためのアルゴリズムを設計する。これは CCE と CE を$\max_{i\le m} A_i$ の複素数多項式で学習する最初の行である。 CE学習アルゴリズムは, 重み付けされたスワップ後悔を最小限に抑える逆行包帯サブルーチンと, 外ループにおけるいくつかの新しいデザインを統合した。第二に、マルコフポテンシャルゲームの重要な特別な場合を検討し、$\widetilde{\mathcal{o}}(s\sum_{i\le m} a_i / \epsilon^3)$エピソード($s$, $a_i$, $\epsilon$のみに依存する場合)内で、$\epsilon$-approximate nash平衡を学習するアルゴリズムを設計する。全体として、ゲーム上の平衡や構造的仮定は、多くのプレイヤーによるサンプル効率の学習を可能にする可能性がある。

関連論文リスト

Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback [60.610120215789976]
純粋な戦略ナッシュ均衡が存在するとき、$c$ は 0 となり、最適のインスタンス依存後悔境界となることを示す。また,本アルゴリズムは最終段階の収束性も享受し,ほぼ最適サンプルを用いて純粋な戦略ナッシュ均衡を同定することができる。
論文参考訳（メタデータ） (2025-02-24T20:20:06Z)
Corrupted Learning Dynamics in Games [62.73758165845971]
すべてのプレイヤーが楽観的な追従型リーダー(OFTRL)に従うと、平衡は$O(log T)$の速さで計算できる。本稿では,各プレイヤーが所定のアルゴリズムによって提案される戦略から逸脱する程度に依存する速度で,適応的に平衡を求める学習ダイナミクスを提案する。
論文参考訳（メタデータ） (2024-12-10T02:23:44Z)
$\widetilde{O}(T^{-1})$ Convergence to (Coarse) Correlated Equilibria in Full-Information General-Sum Markov Games [8.215874655947335]
楽観的フォロー・ザ・レギュラライズド・リーダー・アルゴリズムは,フル情報汎用マルコフゲームにおいて,$widetildeO(T-1)$-approximate iterationsを$T$内で見つけることができることを示す。
論文参考訳（メタデータ） (2024-02-02T20:40:27Z)
Representation Learning for General-sum Low-rank Markov Games [63.119870889883224]
非線形関数近似を用いたマルチエージェント汎用マルコフゲームについて検討する。遷移行列が未知の非線形表現の上に隠れた低ランク構造を持つ低ランクマルコフゲームに焦点を当てる。
論文参考訳（メタデータ） (2022-10-30T22:58:22Z)
Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a Generative Model [50.38446482252857]
2人プレイのゼロサムマルコフゲームは多エージェント強化学習においておそらく最も基本的な設定である。我々は,$$ widetildeObiggを用いて,$varepsilon$-approximate Markov NEポリシーを学習する学習アルゴリズムを開発した。我々は、分散型量の役割を明確にするFTRLに対する洗練された後悔境界を導出する。
論文参考訳（メタデータ） (2022-08-22T17:24:55Z)
Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games [18.832436856339587]
本稿では,オフラインデータから2プレイヤーゼロサムマルコフゲームにおけるナッシュ均衡の学習に向けて前進する。ベルンシュタイン型低信頼境界を持つ悲観的モデルベースアルゴリズム(VI-LCB-Game)を提案する。
論文参考訳（メタデータ） (2022-06-08T17:58:06Z)
Sample-Efficient Learning of Correlated Equilibria in Extensive-Form Games [22.94768429216664]
Imperfect-Information Extensive-Form Games (IIEFG) は、不完全情報とシーケンシャルプレイを含む現実世界のゲームにおいて一般的なモデルである。 Extensive-Form Correlated Equilibrium (EFCE) はマルチプレイヤー汎用IIEFGの自然解の概念として提案されている。本稿では,バンドフィードバックからEFCEを学習するための最初のサンプル効率アルゴリズムを提案する。
論文参考訳（メタデータ） (2022-05-15T08:55:28Z)
Near-Optimal Learning of Extensive-Form Games with Imperfect Information [54.55092907312749]
本稿では,2プレイヤーゼロサムゲームにおいて,$widetildemathcalO((XA+YB)/varepsilon2)$プレイのエピソードのみを必要とするアルゴリズムの最初の行を,$varepsilon$-approximate Nash平衡を求める。これにより$widetildemathcalO((X2A+Y2B)/varepsilon2)$が$widetildemathcalO(maxX,
論文参考訳（メタデータ） (2022-02-03T18:18:28Z)
Almost Optimal Algorithms for Two-player Markov Games with Linear Function Approximation [92.99933928528797]
同時動作による2プレイヤーゼロサムマルコフゲームの強化学習について検討した。我々は,「不確かさの最適性」に基づくアルゴリズムナッシュ-UCRL-VTRを提案する。我々は、Nash-UCRL-VTR が $tildeO(dHsqrtT)$ regret を確実に達成できることを示し、$d$ は線型関数次元である。
論文参考訳（メタデータ） (2021-02-15T09:09:16Z)
Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
我々は,直接の監督なしに自己対決で最適な政策を学習するセルフプレイアルゴリズムに焦点をあてる。本稿では,サンプル複雑性を$tildemathcalO(SAB)$,サンプル複雑性を$tildemathcalO(S(A+B)$とする新しいemphNash Vラーニングアルゴリズムを提案する。
論文参考訳（メタデータ） (2020-06-22T05:00:13Z)
Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity [59.34067736545355]
S$状態、$A$アクション、割引係数$gamma in (0,1)$、近似しきい値$epsilon > 0$の MDP が与えられた場合、$epsilon$-Optimal Policy を学ぶためのモデルなしアルゴリズムを提供する。十分小さな$epsilon$の場合、サンプルの複雑さで改良されたアルゴリズムを示す。
論文参考訳（メタデータ） (2020-06-06T13:34:41Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。