Fugu-MT 論文翻訳(概要): Bandit Linear Optimization for Sequential Decision Making and Extensive-Form Games

論文の概要: Bandit Linear Optimization for Sequential Decision Making and Extensive-Form Games

arxiv url: http://arxiv.org/abs/2103.04546v1
Date: Mon, 8 Mar 2021 05:00:13 GMT
ステータス: 翻訳完了
システム内更新日: 2021-03-10 20:50:11.969606
Title: Bandit Linear Optimization for Sequential Decision Making and Extensive-Form Games
Title（参考訳）: 逐次決定と拡張形式ゲームのための帯域線形最適化
Authors: Gabriele Farina, Robin Schmucker, Tuomas Sandholm
Abstract要約: tree-form sequential decision making (tfsdm) は、エージェントと潜在的に敵対的な環境の間のツリー形式の相互作用をモデル化することで、古典的なワンショット意思決定を拡張する。これは、各プレイヤーが幅広い形式のゲームで直面するオンライン意思決定問題、およびマルコフ決定プロセス、およびエージェントが観測された履歴を条件とする部分観察可能なマルコフ決定プロセスをキャプチャする。本稿では, (i) 線形時間損失と (ii) $o(sqrtt)$ cumulative regret の両方を提供する拡張dmのバンディット線形最適化問題に対する最初のアルゴリズムを提案する。
参考スコア（独自算出の注目度）: 102.23975166536326
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tree-form sequential decision making (TFSDM) extends classical one-shot decision making by modeling tree-form interactions between an agent and a potentially adversarial environment. It captures the online decision-making problems that each player faces in an extensive-form game, as well as Markov decision processes and partially-observable Markov decision processes where the agent conditions on observed history. Over the past decade, there has been considerable effort into designing online optimization methods for TFSDM. Virtually all of that work has been in the full-feedback setting, where the agent has access to counterfactuals, that is, information on what would have happened had the agent chosen a different action at any decision node. Little is known about the bandit setting, where that assumption is reversed (no counterfactual information is available), despite this latter setting being well understood for almost 20 years in one-shot decision making. In this paper, we give the first algorithm for the bandit linear optimization problem for TFSDM that offers both (i) linear-time iterations (in the size of the decision tree) and (ii) $O(\sqrt{T})$ cumulative regret in expectation compared to any fixed strategy, at all times $T$. This is made possible by new results that we derive, which may have independent uses as well: 1) geometry of the dilated entropy regularizer, 2) autocorrelation matrix of the natural sampling scheme for sequence-form strategies, 3) construction of an unbiased estimator for linear losses for sequence-form strategies, and 4) a refined regret analysis for mirror descent when using the dilated entropy regularizer.
Abstract（参考訳）: tree-form sequential decision making (tfsdm) は、エージェントと潜在的に敵対的な環境の間のツリー形式の相互作用をモデル化することで、古典的なワンショット意思決定を拡張する。これは、各プレイヤーが幅広い形式のゲームで直面するオンライン意思決定問題、およびマルコフ決定プロセス、およびエージェントが観測された履歴を条件とする部分観察可能なマルコフ決定プロセスをキャプチャする。過去10年間で、TFSDMのオンライン最適化手法の設計に多大な努力が払われてきた。エージェントは反事実、すなわち、エージェントが任意の決定ノードで異なるアクションを選択した場合に何が起こったかに関する情報にアクセスすることができます。バンディット設定についてはほとんど知られていないが、この仮定は逆転し(偽情報はない)、後者の設定は1ショットの意思決定で約20年間よく理解されている。本稿では、(i)線形時間反復(決定木の大きさ)と(ii) $O(\sqrt{T})$ 累積後悔を、任意の固定戦略と比較して常に$T$で提供するTFSDMのバンドライト線形最適化問題のための最初のアルゴリズムを与える。 1) 拡張エントロピー正則化器の幾何構造, 2) シーケンス形式戦略のための自然サンプリングスキームの自己相関行列, 3) シーケンス形式正則化器を使用する場合の鏡面下降に対する不偏推定器の構成, 4) 拡張エントロピー正則化器を用いた鏡面下降に対する厳格な後悔解析などである。

論文の概要: Bandit Linear Optimization for Sequential Decision Making and Extensive-Form Games

関連論文リスト