Fugu-MT 論文翻訳(概要): A Bayesian Learning Algorithm for Unknown Zero-sum Stochastic Games with an Arbitrary Opponent

論文の概要: A Bayesian Learning Algorithm for Unknown Zero-sum Stochastic Games with an Arbitrary Opponent

arxiv url: http://arxiv.org/abs/2109.03396v3
Date: Mon, 11 Mar 2024 10:06:17 GMT
ステータス: 翻訳完了
システム内更新日: 2024-03-13 18:16:08.522721
Title: A Bayesian Learning Algorithm for Unknown Zero-sum Stochastic Games with an Arbitrary Opponent
Title（参考訳）: 任意の対戦相手を持つ未知ゼロサム確率ゲームに対するベイズ学習アルゴリズム
Authors: Mehdi Jafarnia-Jahromi, Rahul Jain, Ashutosh Nayyar
Abstract要約: ゼロサムゲームのための後サンプリング強化学習(PSRL-ZSG) ゼロサムゲームのための後サンプリング強化学習(PSRL-ZSG)を提案する。
参考スコア（独自算出の注目度）: 9.094186120476174
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we propose Posterior Sampling Reinforcement Learning for Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that achieves Bayesian regret bound of $O(HS\sqrt{AT})$ in the infinite-horizon zero-sum stochastic games with average-reward criterion. Here $H$ is an upper bound on the span of the bias function, $S$ is the number of states, $A$ is the number of joint actions and $T$ is the horizon. We consider the online setting where the opponent can not be controlled and can take any arbitrary time-adaptive history-dependent strategy. Our regret bound improves on the best existing regret bound of $O(\sqrt[3]{DS^2AT^2})$ by Wei et al. (2017) under the same assumption and matches the theoretical lower bound in $T$.
Abstract（参考訳）: 本稿では,ゼロサム確率ゲームのための後方サンプリング強化学習(PSRL-ZSG)を提案する。これは,平均逆基準付き無限水平ゼロサム確率ゲームにおいて,ベイズ的残差を$O(HS\sqrt{AT})$とする最初のオンライン学習アルゴリズムである。ここで、$H$はバイアス関数の幅の上限、$S$は状態の数、$A$は共同アクションの数、$T$は地平線である。我々は、対戦相手を制御できず、任意の時間順応的履歴依存戦略を採れるオンライン環境を考える。我々の後悔境界は、同じ仮定の下でWei et al. (2017) による$O(\sqrt[3]{DS^2AT^2}) の最良の後悔境界を改善し、$T$ の理論的下界と一致する。

関連論文リスト

Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback [60.610120215789976]
純粋な戦略ナッシュ均衡が存在するとき、$c$ は 0 となり、最適のインスタンス依存後悔境界となることを示す。また,本アルゴリズムは最終段階の収束性も享受し,ほぼ最適サンプルを用いて純粋な戦略ナッシュ均衡を同定することができる。
論文参考訳（メタデータ） (2025-02-24T20:20:06Z)
Best of Both Worlds: Regret Minimization versus Minimax Play [57.68976579579758]
この結果から,悪用可能な相手からOmega(T)$を得ることができながら,少なくともO(1)$損失のリスクを保証できることが分かる。
論文参考訳（メタデータ） (2025-02-17T11:04:01Z)
Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback [49.84060509296641]
オンライン有限水平マルコフ決定過程を逆向きに変化した損失と総括的帯域幅フィードバック(フルバンド幅)を用いて研究する。この種のフィードバックの下では、エージェントは、軌跡内の各中間段階における個々の損失よりも、軌跡全体に生じる総損失のみを観察する。この設定のための最初のポリシー最適化アルゴリズムを紹介します。
論文参考訳（メタデータ） (2025-02-06T12:03:24Z)
Model Selection for Average Reward RL with Application to Utility Maximization in Repeated Games [1.9161424760743742]
平均報酬RL設定のためのオンラインモデル選択アルゴリズムを提案する。モデル選択の追加コストは、モデルクラスの数である$M$でのみ線形にスケールすることを示す。また,学習者の後悔度を低く抑えることで,$m*$への指数的依存が避けられないことを示す。
論文参考訳（メタデータ） (2024-11-09T05:03:10Z)
Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
我々のアルゴリズムが $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping is linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ is the standardities of the state and action space。
論文参考訳（メタデータ） (2023-05-15T05:37:32Z)
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
本研究は,リスク許容度が$tau$のCVaR(Conditional Value at Risk)の目的に着目し,リスクに敏感な強化学習(RL)について検討する。ミニマックスCVaRの後悔率は$Omega(sqrttau-1AK)$で、$A$はアクションの数、$K$はエピソード数である。我々は,このアルゴリズムが連続性仮定の下で$widetilde O(tau-1sqrtSAK)$の最適後悔を達成し,一般に近似することを示す。
論文参考訳（メタデータ） (2023-02-07T02:22:31Z)
Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a Generative Model [50.38446482252857]
2人プレイのゼロサムマルコフゲームは多エージェント強化学習においておそらく最も基本的な設定である。我々は,$$ widetildeObiggを用いて,$varepsilon$-approximate Markov NEポリシーを学習する学習アルゴリズムを開発した。我々は、分散型量の役割を明確にするFTRLに対する洗練された後悔境界を導出する。
論文参考訳（メタデータ） (2022-08-22T17:24:55Z)
On Submodular Contextual Bandits [92.45432756301231]
作用が基底集合の部分集合であり、平均報酬が未知の単調部分モジュラ函数によってモデル化されるような文脈的包帯の問題を考える。 Inverse Gap Weighting 戦略により,提案アルゴリズムは推定関数の局所的最適度を効率よくランダム化することを示す。
論文参考訳（メタデータ） (2021-12-03T21:42:33Z)
Fast Rates for Nonparametric Online Learning: From Realizability to Learning in Games [36.969021834291745]
本稿では,仮説クラスの逐次的脂肪散乱次元の観点から,ほぼ最適誤差を導出する固有学習アルゴリズムを提案する。この結果は、適切な学習者が準最適誤り境界を達成できるかどうかという疑問に答える。実数値(回帰)設定では、最適誤り境界は不適切な学習者にさえ知られていなかった。
論文参考訳（メタデータ） (2021-11-17T05:24:21Z)
Minimax Regret for Stochastic Shortest Path [63.45407095296692]
我々は、エージェントが最小の総予想コストで目標状態に達する必要がある最短パス(SSP)問題を研究します。この設定に対するminimaxの後悔は、$widetilde O(B_star sqrt|S| |A|K)$であり、$B_star$は任意の状態から最適なポリシーの予想コストに拘束されることを示しています。本アルゴリズムは, 有限水平MDPにおける強化学習の新たな削減を基礎として, エピソードごとのインタイム動作を行う。
論文参考訳（メタデータ） (2021-03-24T10:11:49Z)
Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
我々は,直接の監督なしに自己対決で最適な政策を学習するセルフプレイアルゴリズムに焦点をあてる。本稿では,サンプル複雑性を$tildemathcalO(SAB)$,サンプル複雑性を$tildemathcalO(S(A+B)$とする新しいemphNash Vラーニングアルゴリズムを提案する。
論文参考訳（メタデータ） (2020-06-22T05:00:13Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。