Fugu-MT 論文翻訳(概要): Model-Free Online Learning in Unknown Sequential Decision Making Problems and Games

論文の概要: Model-Free Online Learning in Unknown Sequential Decision Making Problems and Games

arxiv url: http://arxiv.org/abs/2103.04539v1
Date: Mon, 8 Mar 2021 04:03:24 GMT
ステータス: 翻訳完了
システム内更新日: 2021-03-10 21:51:27.583739
Title: Model-Free Online Learning in Unknown Sequential Decision Making Problems and Games
Title（参考訳）: 未知の順序決定問題とゲームにおけるモデルフリーオンライン学習
Authors: Gabriele Farina, Tuomas Sandholm
Abstract要約: 大規模な2人プレイのゼロサム情報ゲームでは、反事実後悔最小化(cfr)の現代的な拡張がnash均衡を計算するための実用的な技術である。私たちは、戦略空間がエージェントに知られていないオンライン学習設定を形式化します。エージェントが逆の環境に直面しても、その設定に高い確率で$O(T3/4)$後悔を達成する効率的なアルゴリズムを提供します。
参考スコア（独自算出の注目度）: 114.90723492840499
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Regret minimization has proved to be a versatile tool for tree-form sequential decision making and extensive-form games. In large two-player zero-sum imperfect-information games, modern extensions of counterfactual regret minimization (CFR) are currently the practical state of the art for computing a Nash equilibrium. Most regret-minimization algorithms for tree-form sequential decision making, including CFR, require (i) an exact model of the player's decision nodes, observation nodes, and how they are linked, and (ii) full knowledge, at all times t, about the payoffs -- even in parts of the decision space that are not encountered at time t. Recently, there has been growing interest towards relaxing some of those restrictions and making regret minimization applicable to settings for which reinforcement learning methods have traditionally been used -- for example, those in which only black-box access to the environment is available. We give the first, to our knowledge, regret-minimization algorithm that guarantees sublinear regret with high probability even when requirement (i) -- and thus also (ii) -- is dropped. We formalize an online learning setting in which the strategy space is not known to the agent and gets revealed incrementally whenever the agent encounters new decision points. We give an efficient algorithm that achieves $O(T^{3/4})$ regret with high probability for that setting, even when the agent faces an adversarial environment. Our experiments show it significantly outperforms the prior algorithms for the problem, which do not have such guarantees. It can be used in any application for which regret minimization is useful: approximating Nash equilibrium or quantal response equilibrium, approximating coarse correlated equilibrium in multi-player games, learning a best response, learning safe opponent exploitation, and online play against an unknown opponent/environment.
Abstract（参考訳）: 後悔の最小化は、ツリー形式のシーケンシャルな意思決定と広範囲なフォームゲームのための多用途なツールであることが証明されている。大規模な2人プレイのゼロサム情報ゲームでは、反事実後悔最小化(cfr)の現代的な拡張がnash均衡を計算するための実用的な技術である。 Most regret-minimization algorithms for tree-form sequential decision making, including CFR, require (i) an exact model of the player's decision nodes, observation nodes, and how they are linked, and (ii) full knowledge, at all times t, about the payoffs -- even in parts of the decision space that are not encountered at time t. Recently, there has been growing interest towards relaxing some of those restrictions and making regret minimization applicable to settings for which reinforcement learning methods have traditionally been used -- for example, those in which only black-box access to the environment is available. 私たちはまず,要件(i) -- そして(ii) -- が削除された場合でも,サブリニアな後悔を高い確率で保証する,後悔最小化アルゴリズムを与えます。我々は,戦略空間がエージェントに知られず,エージェントが新たな決定点に遭遇するたびに徐々に明らかにされるオンライン学習設定を定式化する。我々は,エージェントが敵の環境に直面する場合であっても,その設定に高い確率で,$O(T^{3/4})の後悔を達成できる効率的なアルゴリズムを与える。私たちの実験では、そのような保証がない問題の以前のアルゴリズムを大幅に上回ることが示されています。後悔の最小化が有用であるアプリケーションには、ナッシュ平衡または量子応答平衡の近似、マルチプレイヤーゲームにおける粗い相関平衡の近似、最良の反応の学習、安全な相手の搾取の学習、未知の相手/環境に対するオンラインプレイなどがある。

関連論文リスト

Meta-Learning in Self-Play Regret Minimization [10.843705580746397]
両プレイヤーゼロサムゲームにおけるナッシュ均衡を近似する多くのアルゴリズムにおいて,オンライン最適化に対する一般的なアプローチを提案する。これに基づいてフレームワークを、最先端の平衡近似アルゴリズムの基盤である、より困難なセルフプレイ設定に拡張する。私たちのメタ学習アルゴリズムは、他の最先端の後悔の最小化アルゴリズムよりもかなり優れています。
論文参考訳（メタデータ） (2025-04-26T13:27:24Z)
Barriers to Welfare Maximization with No-Regret Learning [68.66209476382213]
我々は、ほぼ最適の$T$-sparse CCEの計算限界を低く証明する。特に,最大傾斜角の不適応性は,時間内に非自明な間隔を達成できないことを示す。
論文参考訳（メタデータ） (2024-11-04T00:34:56Z)
Hardness of Independent Learning and Sparse Equilibrium Computation in Markov Games [70.19141208203227]
マルコフゲームにおける分散型マルチエージェント強化学習の問題点を考察する。我々は,全てのプレイヤーが独立に実行すると,一般のサムゲームにおいて,アルゴリズムが到達しないことを示す。我々は,全てのエージェントが集中型アルゴリズムによって制御されるような,一見簡単な設定であっても,下位境界が保持されていることを示す。
論文参考訳（メタデータ） (2023-03-22T03:28:12Z)
Oracle-Efficient Smoothed Online Learning for Piecewise Continuous Decision Making [73.48977854003697]
この研究は、複雑性という新しい概念、一般化ブラケット数を導入し、空間の大きさに対する敵の制約を結婚させる。次に、オンライン予測や断片的連続関数の計画など、関心のあるいくつかの問題で境界をインスタンス化する。
論文参考訳（メタデータ） (2023-02-10T18:45:52Z)
Decentralized model-free reinforcement learning in stochastic games with average-reward objective [1.9852463786440127]
本アルゴリズムは,次数$T3/4$のサブ線形高確率後悔と次数$T2/3$のサブ線形高確率後悔を実現する。本アルゴリズムは,従来の手法に比べて計算量が少なく,メモリスペースも少ない。
論文参考訳（メタデータ） (2023-01-13T15:59:53Z)
Regret Minimization and Convergence to Equilibria in General-sum Markov Games [57.568118148036376]
汎用マルコフゲームにおいて,全てのエージェントが実行した場合のサブ線形後悔保証を提供する学習アルゴリズムを初めて提示する。我々のアルゴリズムは分散化され、計算効率が良く、エージェント間の通信は不要である。
論文参考訳（メタデータ） (2022-07-28T16:27:59Z)
ESCHER: Eschewing Importance Sampling in Games by Computing a History Value Function to Estimate Regret [97.73233271730616]
超大型ゲームにおけるナッシュ均衡の近似手法 : ニューラルネットワークを用いて近似最適ポリシー(戦略)を学習する DREAMは,モンテカルロCFR(MCCFR)から受け継がれた重要なサンプリング項により,極めて高いばらつきを有すると推定された後悔のターゲット上で,ニューラルネットワークを訓練する。 ESCHERの深層学習バージョンは、DREAMとニューラル・フィクション・セルフプレイ(NFSP)の先行状態よりも優れており、ゲームサイズが大きくなるにつれて、その違いは劇的になる。
論文参考訳（メタデータ） (2022-06-08T18:43:45Z)
Online Multiobjective Minimax Optimization and Applications [14.699969822572308]
本稿では,適応的な対戦相手が新しいゲームを導入する,シンプルだが汎用的なオンライン学習フレームワークを提案する。学習者のゴールは、累積ベクトル値損失の最大座標を最小化することである。対戦相手がまず行動を発表しなければならない設定と競合する簡単なアルゴリズムを提供する。最適なアルゴリズムと境界を回復して、外部の後悔、内部の後悔、適応的な後悔、多集団の後悔、その後の後悔、睡眠専門家の設定における後悔の概念を最小化できます。
論文参考訳（メタデータ） (2021-08-09T06:52:08Z)
Hindsight and Sequential Rationality of Correlated Play [18.176128899338433]
私たちは、修正された振る舞いで達成できたことに対して、強いパフォーマンスを後見で保証するアルゴリズムを検討します。我々は,学習の隠れた枠組みを,逐次的な意思決定の場で開発し,提唱する。本稿では,それぞれの平衡の強さと弱さを文献に示す例を示す。
論文参考訳（メタデータ） (2020-12-10T18:30:21Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。