Fugu-MT 論文翻訳(概要): Logarithmic Regret for Matrix Games against an Adversary with Noisy Bandit Feedback

論文の概要: Logarithmic Regret for Matrix Games against an Adversary with Noisy Bandit Feedback

arxiv url: http://arxiv.org/abs/2306.13233v2
Date: Tue, 24 Oct 2023 22:51:29 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-26 20:23:02.119411
Title: Logarithmic Regret for Matrix Games against an Adversary with Noisy Bandit Feedback
Title（参考訳）: 雑音帯域フィードバックを持つ逆数に対する行列ゲームに対する対数レグレット
Authors: Arnab Maiti, Kevin Jamieson, Lillian J. Ratliff
Abstract要約: 本稿では,行プレーヤが行$i$を選択し,列プレーヤが列$j$を選択し,行プレーヤが平均$A_i,j$でノイズの多い報酬を受け取るゼロサム行列ゲームの変形について考察する。行プレーヤの目的は、敵列プレーヤに対してもできるだけ多くの報酬を蓄積することである。もし行プレーヤが、任意の報酬列に対して$sqrtT$ regretを得るアルゴリズムであるEXP3戦略を使用するなら、行プレーヤも$sqrtTを達成できるのは即時である。
参考スコア（独自算出の注目度）: 23.97611639885236
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper considers a variant of zero-sum matrix games where at each timestep the row player chooses row $i$, the column player chooses column $j$, and the row player receives a noisy reward with mean $A_{i,j}$. The objective of the row player is to accumulate as much reward as possible, even against an adversarial column player. If the row player uses the EXP3 strategy, an algorithm known for obtaining $\sqrt{T}$ regret against an arbitrary sequence of rewards, it is immediate that the row player also achieves $\sqrt{T}$ regret relative to the Nash equilibrium in this game setting. However, partly motivated by the fact that the EXP3 strategy is myopic to the structure of the game, O'Donoghue et al. (2021) proposed a UCB-style algorithm that leverages the game structure and demonstrated that this algorithm greatly outperforms EXP3 empirically. While they showed that this UCB-style algorithm achieved $\sqrt{T}$ regret, in this paper we ask if there exists an algorithm that provably achieves $\text{polylog}(T)$ regret against any adversary, analogous to results from stochastic bandits. We propose a novel algorithm that answers this question in the affirmative for the simple $2 \times 2$ setting, providing the first instance-dependent guarantees for games in the regret setting. Our algorithm overcomes two major hurdles: 1) obtaining logarithmic regret even though the Nash equilibrium is estimable only at a $1/\sqrt{T}$ rate, and 2) designing row-player strategies that guarantee that either the adversary provides information about the Nash equilibrium, or the row player incurs negative regret. Moreover, in the full information case we address the general $n \times m$ case where the first hurdle is still relevant. Finally, we show that EXP3 and the UCB-based algorithm necessarily cannot perform better than $\sqrt{T}$.
Abstract（参考訳）: 本稿では,列プレイヤーが行$i$を選択し,列プレイヤーが列$j$を選択し,列プレイヤーが平均$a_{i,j}$で騒がしい報酬を受け取る,ゼロサムマトリクスゲームの一変型について考察する。行プレイヤーの目的は、敵列プレイヤーに対してさえ、できるだけ多くの報酬を蓄積することである。もし行プレーヤが任意の報酬列に対して$\sqrt{T}$後悔を得るアルゴリズムであるEXP3戦略を使用すると、このゲーム設定におけるナッシュ平衡に対して$\sqrt{T}$後悔も達成される。しかしながら、EXP3戦略がゲームの構造のミオピックであるという事実から、O'Donoghue et al. (2021) はゲーム構造を活用する UCB スタイルのアルゴリズムを提案し、このアルゴリズムがEXP3を経験的に大きく上回ることを示した。彼らは、このucbスタイルのアルゴリズムが$\sqrt{t}$ regretを達成したことを示したが、本論文では、任意の敵に対して$\text{polylog}(t)$ regretを確実に達成するアルゴリズムが存在するかどうかを問う。単純な2 \times 2$設定を肯定する形で、この質問に答える新しいアルゴリズムを提案し、後悔の設定におけるゲームに対する最初のインスタンス依存保証を提供する。我々のアルゴリズムは2つの大きなハードルを克服します 1)nash平衡は1/\sqrt{t}$レートでしか推定できないが、対数的後悔を得る。 2) 敵がナッシュ均衡に関する情報を提供するか、または行プレイヤーが負の後悔をもたらすかを保証する行プレイヤー戦略を設計する。さらに、全情報の場合、最初のハードルがまだ関係している一般的な$n \times m$ケースに対処する。最後に、EXP3 と UCB ベースのアルゴリズムは、必ずしも $\sqrt{T}$ 以上の性能を発揮できないことを示す。

関連論文リスト

Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback [60.610120215789976]
純粋な戦略ナッシュ均衡が存在するとき、$c$ は 0 となり、最適のインスタンス依存後悔境界となることを示す。また,本アルゴリズムは最終段階の収束性も享受し,ほぼ最適サンプルを用いて純粋な戦略ナッシュ均衡を同定することができる。
論文参考訳（メタデータ） (2025-02-24T20:20:06Z)
Best of Both Worlds: Regret Minimization versus Minimax Play [57.68976579579758]
この結果から,悪用可能な相手からOmega(T)$を得ることができながら,少なくともO(1)$損失のリスクを保証できることが分かる。
論文参考訳（メタデータ） (2025-02-17T11:04:01Z)
Blackwell's Approachability with Approximation Algorithms [15.266876140036052]
プレイヤーと相手のベクター値の繰り返しゲームを考える。プレイヤーのセットのみに近似アルゴリズムが組み込まれている場合、よりシンプルで効率的なアルゴリズムが提供される。
論文参考訳（メタデータ） (2025-02-06T09:54:00Z)
Context-lumpable stochastic bandits [49.024050919419366]
我々は、$S$コンテキストと$K$アクションによる文脈的盗賊問題を考える。我々は,最大$widetilde O(r (S +K )/epsilon2)$サンプルを用いて,$epsilon$-optimal Policyを出力するアルゴリズムを提案する。後悔の設定では、T$までの累積後悔を$widetilde O(sqrtr3(S+K)T)$で束縛するアルゴリズムを与える。
論文参考訳（メタデータ） (2023-06-22T17:20:30Z)
Near-Optimal Learning of Extensive-Form Games with Imperfect Information [54.55092907312749]
本稿では,2プレイヤーゼロサムゲームにおいて,$widetildemathcalO((XA+YB)/varepsilon2)$プレイのエピソードのみを必要とするアルゴリズムの最初の行を,$varepsilon$-approximate Nash平衡を求める。これにより$widetildemathcalO((X2A+Y2B)/varepsilon2)$が$widetildemathcalO(maxX,
論文参考訳（メタデータ） (2022-02-03T18:18:28Z)
Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback [29.553652241608997]
本研究では,テキストモオと強いモノトーンゲームの研究を行い,その学習方法について検討した。我々はまず,新しい帯域学習アルゴリズムを構築し,$tildeTheta(nsqrtT)$の単一エージェント最適後悔を実現することを示す。そこで我々は,このオープンな問題を解決し,広範にわたるバンディットゲーム理論学習に寄与した。
論文参考訳（メタデータ） (2021-12-06T08:27:54Z)
Almost Optimal Algorithms for Two-player Markov Games with Linear Function Approximation [92.99933928528797]
同時動作による2プレイヤーゼロサムマルコフゲームの強化学習について検討した。我々は,「不確かさの最適性」に基づくアルゴリズムナッシュ-UCRL-VTRを提案する。我々は、Nash-UCRL-VTR が $tildeO(dHsqrtT)$ regret を確実に達成できることを示し、$d$ は線型関数次元である。
論文参考訳（メタデータ） (2021-02-15T09:09:16Z)
Impact of Representation Learning in Linear Bandits [83.17684841392754]
本研究では,表現学習が帯域幅問題の効率性を向上させる方法について検討する。我々は,$widetildeO(TsqrtkN + sqrtdkNT)$ regretを達成する新しいアルゴリズムを提案する。
論文参考訳（メタデータ） (2020-10-13T16:35:30Z)
Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
我々は,直接の監督なしに自己対決で最適な政策を学習するセルフプレイアルゴリズムに焦点をあてる。本稿では,サンプル複雑性を$tildemathcalO(SAB)$,サンプル複雑性を$tildemathcalO(S(A+B)$とする新しいemphNash Vラーニングアルゴリズムを提案する。
論文参考訳（メタデータ） (2020-06-22T05:00:13Z)
Improved Sleeping Bandits with Stochastic Actions Sets and Adversarial Rewards [59.559028338399855]
我々は,行動セットと敵意の報酬を伴って睡眠中の盗賊の問題を考察する。本稿では,EXP3にインスパイアされた新しい計算効率のよいアルゴリズムを提案する。
論文参考訳（メタデータ） (2020-04-14T00:41:26Z)
Regret Minimization in Stochastic Contextual Dueling Bandits [40.17224226373741]
我々は、コンテキスト設定において、$K$武装デュエルバンディットの問題を考察する。提案手法は, それぞれ, 後悔の保証を施した2つのアルゴリズムを提案する。
論文参考訳（メタデータ） (2020-02-20T06:36:19Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。