Fugu-MT 論文翻訳(概要): Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions

論文の概要: Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions

arxiv url: http://arxiv.org/abs/2605.09363v1
Date: Sun, 10 May 2026 06:23:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.214294
Title: Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions
Title（参考訳）: 帯域フィードバックと応答動作を持つゼロサムゲームにおける準最適最終イテレート収束
Authors: Soumita Hait, Ping Li, Haipeng Luo, Mengxiao Zhang,
Abstract要約: ゲームにおける学習力学の最後の項目収束は、近年大きな注目を集めている。我々は, t(-1/2) の終点収束は, バンディットフィードバックを持つゲームにおいて高い確率で達成可能であることを示す。
参考スコア（独自算出の注目度）: 43.45624707071202
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Last-iterate convergence of learning dynamics in games has attracted significant recent attention. In two-player zero-sum games with bandit feedback, where only the loss of the selected action pair is observed, Fiegel et al. (2025) show a separation between average-iterate and last-iterate convergence in duality gap: while the optimal t^(-1/2) rate after t rounds is achievable for the former via standard no-regret algorithms, the latter cannot converge faster than t^(-1/3) in expectation or t^(-1/4) with high probability. However, in many practical settings, such as preference learning, the players observe not only their loss but also the opponent's action. This raises a natural question: can such additional information enable faster last-iterate convergence? We answer this question affirmatively, showing that t^(-1/2) last-iterate convergence is achievable with high probability in this setting, via an efficient algorithm that updates its strategy infrequently by solving an estimated log-barrier-regularized game. We identify fundamental obstacles preventing standard analysis for multi-armed bandits, the single-player case, from generalizing to games, and develop a novel analysis to overcome them. Experiments confirm that our algorithm indeed converges faster than naive baselines and prior methods that do not exploit opponent-action feedback. Finally, we note that our results also improve those for dueling bandits, a special case with skew-symmetric game matrices.
Abstract（参考訳）: ゲームにおける学習力学の最後の項目収束は、近年大きな注目を集めている。 Fiegel et al (2025) は、選択されたアクション対の損失のみが観測される2つのプレイヤーゼロサムゲームにおいて、二元性ギャップにおける平均点収束と最終点収束の分離を示す: t ラウンド後の最適 t^(-1/2) レートは標準のno-regret アルゴリズムによって前者に対して達成可能であるが、後者は期待されるときに t^(-1/3) よりも速く収束することができない。しかし、嗜好学習など多くの実践的な環境では、プレイヤーは損失だけでなく、相手の行動も観察する。このような追加情報は、最終段階の収束を早めることができるだろうか? この問題に対して, t^(-1/2) の終点収束が高い確率で達成可能であることを示す上で, 対数バリア正規化ゲームを用いて, その戦略を頻繁に更新するアルゴリズムを提案する。シングルプレイヤーのケースであるマルチアームバンディットの標準解析がゲームへの一般化を阻害する基本的障害を特定し、それらを克服するための新しい分析を開発する。実験により,本アルゴリズムは,本アルゴリズムの初歩的なベースラインや,反作用フィードバックを生かさない先行手法よりも早く収束することが確認された。最後に、スキュー対称なゲーム行列を持つ特殊なケースであるデュエルバンディットについても改善した点に留意する。

論文の概要: Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions

関連論文リスト