Fugu-MT 論文翻訳(概要): Multiplayer Bandit Learning, from Competition to Cooperation

論文の概要: Multiplayer Bandit Learning, from Competition to Cooperation

arxiv url: http://arxiv.org/abs/1908.01135v4
Date: Fri, 12 Jan 2024 14:32:25 GMT
ステータス: 翻訳完了
システム内更新日: 2024-01-16 00:36:22.924649
Title: Multiplayer Bandit Learning, from Competition to Cooperation
Title（参考訳）: マルチプレイヤーバンド学習 : 競争から協力へ
Authors: Simina Br\^anzei and Yuval Peres
Abstract要約: コンペティションと協力が探究と搾取のトレードオフに与える影響について検討する。このモデルは、通常プレイヤーが互いの報酬を観察する戦略実験に関する経済学文献と関連している。
参考スコア（独自算出の注目度）: 3.7801191959442053
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The stochastic multi-armed bandit model captures the tradeoff between exploration and exploitation. We study the effects of competition and cooperation on this tradeoff. Suppose there are $k$ arms and two players, Alice and Bob. In every round, each player pulls an arm, receives the resulting reward, and observes the choice of the other player but not their reward. Alice's utility is $\Gamma_A + \lambda \Gamma_B$ (and similarly for Bob), where $\Gamma_A$ is Alice's total reward and $\lambda \in [-1, 1]$ is a cooperation parameter. At $\lambda = -1$ the players are competing in a zero-sum game, at $\lambda = 1$, they are fully cooperating, and at $\lambda = 0$, they are neutral: each player's utility is their own reward. The model is related to the economics literature on strategic experimentation, where usually players observe each other's rewards. With discount factor $\beta$, the Gittins index reduces the one-player problem to the comparison between a risky arm, with a prior $\mu$, and a predictable arm, with success probability $p$. The value of $p$ where the player is indifferent between the arms is the Gittins index $g = g(\mu,\beta) > m$, where $m$ is the mean of the risky arm. We show that competing players explore less than a single player: there is $p^* \in (m, g)$ so that for all $p > p^*$, the players stay at the predictable arm. However, the players are not myopic: they still explore for some $p > m$. On the other hand, cooperating players explore more than a single player. We also show that neutral players learn from each other, receiving strictly higher total rewards than they would playing alone, for all $ p\in (p^*, g)$, where $p^*$ is the threshold from the competing case. Finally, we show that competing and neutral players eventually settle on the same arm in every Nash equilibrium, while this can fail for cooperating players.
Abstract（参考訳）: 確率的多腕バンディットモデルは探索と搾取の間のトレードオフを捉えている。このトレードオフに対する競争と協力の効果について検討する。 k$の腕とアリスとボブの2人のプレーヤーがいるとしよう。各ラウンドにおいて、各プレイヤーは腕を引っ張り、その結果得られる報酬を受け取り、他のプレイヤーの選択を観察するが、報酬は与えない。 Aliceのユーティリティは$\Gamma_A + \lambda \Gamma_B$(Bobも同様)であり、$\Gamma_A$はAliceの総報酬であり、$\lambda \in [-1, 1]$は協力パラメータである。プレイヤーは$\lambda = -1$でゼロサムゲームに出場し、$\lambda = 1$で完全に協力し、$\lambda = 0$では中立である。このモデルは、通常プレイヤーが互いの報酬を観察する戦略実験に関する経済学文献と関連している。割引係数 $\beta$ で、Gittins インデックスはリスクのあるアームと予測可能なアーム、成功確率 $p$ の比較に1人のプレイヤー問題を還元する。プレイヤーが腕の間に無関心な$p$の値は、Gittins index $g = g(\mu,\beta) > m$である。競技者が単一のプレイヤーより少ない探索を行うことを示す:$p^* \in (m, g)$なので、すべての$p > p^*$に対して、プレイヤーは予測可能なアームに留まる。しかし、プレイヤーは目立たない:彼らはまだ約$p > m$を求めて探索している。一方、協力的なプレイヤーは1人以上のプレイヤーを探索する。また、中立プレイヤーは互いに学習し、単独でプレイするよりも厳密に高い報酬を受け取り、全ての$p\in (p^*, g)$に対して、$p^*$が競合するケースのしきい値であることを示す。最後に、競争相手と中立相手のプレイヤーは、nash平衡ごとに同じ腕に落ち着くが、これは協力するプレイヤーには失敗する可能性がある。

関連論文リスト

Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback [60.610120215789976]
純粋な戦略ナッシュ均衡が存在するとき、$c$ は 0 となり、最適のインスタンス依存後悔境界となることを示す。また,本アルゴリズムは最終段階の収束性も享受し,ほぼ最適サンプルを用いて純粋な戦略ナッシュ均衡を同定することができる。
論文参考訳（メタデータ） (2025-02-24T20:20:06Z)
Best of Both Worlds: Regret Minimization versus Minimax Play [57.68976579579758]
この結果から,悪用可能な相手からOmega(T)$を得ることができながら,少なくともO(1)$損失のリスクを保証できることが分かる。
論文参考訳（メタデータ） (2025-02-17T11:04:01Z)
Heterogeneous Multi-Player Multi-Armed Bandits Robust To Adversarial Attacks [19.184883255588126]
システム内のプレイヤーが受ける報酬に負の影響を及ぼそうとする敵の存在下で、マルチプレイヤーバンディットの設定を考える。衝突(複数のプレイヤーが同じ腕を選択している)が発生した場合、すべての衝突したユーザーは報酬を受け取らない。敵は、プレイヤーが受ける報酬、すなわち敵が腕を攻撃した場合、その腕を選択するプレイヤーはゼロの報酬を受ける。
論文参考訳（メタデータ） (2025-01-21T08:51:23Z)
Corrupted Learning Dynamics in Games [62.73758165845971]
すべてのプレイヤーが楽観的な追従型リーダー(OFTRL)に従うと、平衡は$O(log T)$の速さで計算できる。本稿では,各プレイヤーが所定のアルゴリズムによって提案される戦略から逸脱する程度に依存する速度で,適応的に平衡を求める学習ダイナミクスを提案する。
論文参考訳（メタデータ） (2024-12-10T02:23:44Z)
Fast Last-Iterate Convergence of Learning in Games Requires Forgetful Algorithms [71.73971094342349]
オンライン学習によるセルフプレイは、大規模な2人プレイのゼロサムゲームを解くための重要な方法の1つだ。我々は,OMWUが支払行列のサイズに対数依存するなど,いくつかの利点があることを示した。我々は、過去のことをすぐに忘れない幅広い種類のアルゴリズムが、すべて同じ問題に悩まされていることを証明している。
論文参考訳（メタデータ） (2024-06-15T13:26:17Z)
Adversarial Combinatorial Bandits with Switching Costs [55.2480439325792]
本稿では,各ラウンドで選択したアームのスイッチに対して,切替コストが$lambda$の逆帯域問題について検討する。ミニマックスの後悔とそれにアプローチするための設計アルゴリズムの限界を低くする。
論文参考訳（メタデータ） (2024-04-02T12:15:37Z)
Optimal Cooperative Multiplayer Learning Bandits with Noisy Rewards and No Communication [0.0]
我々は,プレイヤーが事前に戦略に合意することのみを許される,協調的なマルチプレイヤーバンディット学習問題を考える。この問題では、各プレイヤーが同時にアクションを選択する。我々は,このアルゴリズムが対数的$O(fraclog TDelta_bma)$(gap依存)後悔および$O(sqrtTlog T)$(gap非依存)後悔を達成することを示す。
論文参考訳（メタデータ） (2023-11-10T17:55:44Z)
Competing for Shareable Arms in Multi-Player Multi-Armed Bandits [29.08799537067425]
本稿では,プレイヤーが自尊心を持ち,自己報酬を最大化することを目的とした,新しいマルチプレイヤーマルチアームバンディット(MPMAB)について検討する。本稿では, 平均アロケーション (SMAA) を用いた新たな自己中心型MPMABを提案する。我々は,一人の利己的なプレイヤーが,逸脱によって報酬を著しく増加させることはできず,また,他のプレイヤーの報酬に有害な影響も与えないことを確認した。
論文参考訳（メタデータ） (2023-05-30T15:59:56Z)
The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no Communication [10.446001329147112]
マルチプレイヤーのマルチアームバンディット問題について検討する。この問題では、$m$プレーヤーは、合計報酬を$K > m$アームから最大化するために協力する。ここで$Delta$は$m$-thと$m+1$-stのベストアームのギャップである。
論文参考訳（メタデータ） (2022-02-19T18:19:36Z)
Near-Optimal Learning of Extensive-Form Games with Imperfect Information [54.55092907312749]
本稿では,2プレイヤーゼロサムゲームにおいて,$widetildemathcalO((XA+YB)/varepsilon2)$プレイのエピソードのみを必要とするアルゴリズムの最初の行を,$varepsilon$-approximate Nash平衡を求める。これにより$widetildemathcalO((X2A+Y2B)/varepsilon2)$が$widetildemathcalO(maxX,
論文参考訳（メタデータ） (2022-02-03T18:18:28Z)
Experimental test of Tsirelson's bound with a single photonic qubit [8.8709589922781]
Clauser-Horne-Shimony-Holt ゲームでは、Alice と Bob はそれぞれ古典的なビット $a$ と $b$ を割り当てられる。ゲームでは、プレイヤーが古典的な戦略を使用する場合、最適な成功確率は$w(textCHSH)=0.75$である。ポープスクとローリッヒは、完全成功確率1ドルは、符号なしの仮定に違反することなくより一般的な理論でも達成できると述べた。
論文参考訳（メタデータ） (2022-01-25T09:06:53Z)
When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently? [10.397170312149067]
本稿では,m$-player General-sum Markovゲームの設定において,学習目標がより優れたサンプル複雑性を許容するものについて検討する。まず,$epsilon$-Coarse Correlated Equilibrium (CCE)を$widetildemathcalO(H5Smax_ile m A_i / epsilon2)$ episodesで学習するアルゴリズムを設計する。第二に、マルコフポテンシャルゲームの重要な特別な場合を考え、$widet内で$epsilon$-approximate Nash平衡を学ぶアルゴリズムを設計する。
論文参考訳（メタデータ） (2021-10-08T15:06:22Z)
Combinatorial Bandits without Total Order for Arms [52.93972547896022]
セット依存報酬分布を捕捉し、武器の合計順序を仮定しない報酬モデルを提案する。我々は、新しい後悔分析を開発し、$Oleft(frack2 n log Tepsilonright)$ gap-dependent regret boundと$Oleft(k2sqrtn T log Tright)$ gap-dependent regret boundを示す。
論文参考訳（メタデータ） (2021-03-03T23:08:59Z)
Bandit Learning in Decentralized Matching Markets [82.39061186055775]
私たちは、一方の市場(プレーヤー)が他方の側(腕)の好みについて事前知識を持っていない両面マッチング市場を研究し、経験からその好みを学ぶ必要があります。このモデルは、標準のマルチアームバンディットフレームワークを競合する分散型マルチプレイヤー設定に拡張します。アームの選好が共有されるたびにアルゴリズムがインセンティブ互換であることが示されるが、選好が完全に一般的である場合には必ずしもそうではない。
論文参考訳（メタデータ） (2020-12-14T08:58:07Z)
Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
我々は,直接の監督なしに自己対決で最適な政策を学習するセルフプレイアルゴリズムに焦点をあてる。本稿では,サンプル複雑性を$tildemathcalO(SAB)$,サンプル複雑性を$tildemathcalO(S(A+B)$とする新しいemphNash Vラーニングアルゴリズムを提案する。
論文参考訳（メタデータ） (2020-06-22T05:00:13Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。