Fugu-MT 論文翻訳(概要): Banker Online Mirror Descent: A Universal Approach for Delayed Online Bandit Learning

論文の概要: Banker Online Mirror Descent: A Universal Approach for Delayed Online Bandit Learning

arxiv url: http://arxiv.org/abs/2301.10500v1
Date: Wed, 25 Jan 2023 10:23:46 GMT
ステータス: 翻訳完了
システム内更新日: 2023-01-26 15:29:12.504035
Title: Banker Online Mirror Descent: A Universal Approach for Delayed Online Bandit Learning
Title（参考訳）: Banker Online Mirror Descent: 遅延オンラインバンド学習のためのユニバーサルアプローチ
Authors: Jiatai Huang, Yan Dai, Longbo Huang
Abstract要約: Banker-OMDは、オンライン学習文学における古典的オンラインミラー・ダイアンス(OMD)技法を一般化する新しいフレームワークである。遅延フィードバックを伴う2つの重要なバンディット学習シナリオに適用することで,textttBanker-OMDのパワーを実証する。 Banker-OMDは、$tildemathcal O(sqrtK(D+T)L)$ regretを達成し、$tildemathcal Oを達成した最初の遅延逆線形帯域アルゴリズムをもたらす。
参考スコア（独自算出の注目度）: 22.18577284185939
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We propose `Banker-OMD`, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in the online learning literature. The `Banker-OMD` framework almost completely decouples feedback delay handling and the task-specific OMD algorithm design, thus allowing the easy design of new algorithms capable of easily and robustly handling feedback delays. Specifically, it offers a general methodology for achieving $\tilde{\mathcal O}(\sqrt{T} + \sqrt{D})$-style regret bounds in online bandit learning tasks with delayed feedback, where $T$ is the number of rounds and $D$ is the total feedback delay. We demonstrate the power of \texttt{Banker-OMD} by applications to two important bandit learning scenarios with delayed feedback, including delayed scale-free adversarial Multi-Armed Bandits (MAB) and delayed adversarial linear bandits. `Banker-OMD` leads to the first delayed scale-free adversarial MAB algorithm achieving $\tilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret and the first delayed adversarial linear bandit algorithm achieving $\tilde{\mathcal O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret. As a corollary, the first application also implies $\tilde{\mathcal O}(\sqrt{KT}L)$ regret for non-delayed scale-free adversarial MABs, which is the first to match the $\Omega(\sqrt{KT}L)$ lower bound up to logarithmic factors and can be of independent interest.
Abstract（参考訳）: 本稿では,オンライン学習文献における古典的オンラインミラー・ダイアンス(OMD)技法を一般化した新しいフレームワークであるBanker-OMDを提案する。 Banker-OMD"フレームワークは、フィードバック遅延処理とタスク固有のOMDアルゴリズム設計をほぼ完全に分離しているため、フィードバック遅延を容易にかつ堅牢に処理できる新しいアルゴリズムを簡単に設計できる。具体的には、フィードバックが遅れたオンラインバンディット学習タスクにおける$\tilde{\mathcal O}(\sqrt{T} + \sqrt{D})$-style regret boundsを達成するための一般的な方法論を提供する。遅延フィードバックを伴う2つの重要なバンディット学習シナリオに対して,mab (scale-free adversarial multi-armed bandit) と遅延線形バンディット (delayed adversarial linear bandit) を応用して, \texttt{banker-omd} のパワーを実証した。バンクーomd は、最初の遅延スケールフリーな逆向き mab アルゴリズムが $\tilde{\mathcal o}(\sqrt{k(d+t)}l)$ を達成し、$\tilde{\mathcal o}(\text{poly}(n)(\sqrt{t} + \sqrt{d})$ regret となる。結論として、最初の応用は、非遅延スケール自由逆数 MAB に対する $\tilde{\mathcal O}(\sqrt{KT}L)$ regret を意味し、これは $\Omega(\sqrt{KT}L)$ lower bound to logarithmic factors and can be independent interest である。

関連論文リスト

Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs [60.7808741738461]
我々は,遅延フィードバックのために,過去のラウンドを同時に追跡できる回数を制限する新しい「透明度」の下で,目立った損失の遅延を伴ってオンライン学習を研究する。我々のアルゴリズムは、全てのキャパシティレベルにおいて、最適以下のキャパシティの優雅な性能で、最小最適後悔を実現する。
論文参考訳（メタデータ） (2025-03-25T17:20:39Z)
Bandit and Delayed Feedback in Online Structured Prediction [25.637979150499874]
要求の少ないフィードバック、帯域幅、遅延フィードバックを扱うアルゴリズムを提案する。バンディットフィードバックを用いたオンライン分類において,提案アルゴリズムの性能を数値的に評価する。
論文参考訳（メタデータ） (2025-02-26T00:00:15Z)
Federated Combinatorial Multi-Agent Multi-Armed Bandits [79.1700188160944]
本稿では,Banditを用いたオンライン最適化に適したフェデレーション学習フレームワークを提案する。この設定では、エージェントのアームサブセットは、個々のアーム情報にアクセスせずにこれらのサブセットに対するノイズの多い報酬を観察し、特定の間隔で協力して情報を共有することができる。
論文参考訳（メタデータ） (2024-05-09T17:40:09Z)
Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
遅延フィードバックを伴うバンド凸最適化(BCO)。我々は,新しいアルゴリズムを開発し,一般にO(sqrtnT3/4+sqrtdT)$の後悔境界を満足していることを証明する。提案アルゴリズムは,強い凸関数に対して$O((nT)2/3log/3T+dlog T)$に制限された後悔を改善することができることを示す。
論文参考訳（メタデータ） (2024-02-14T13:08:26Z)
Online Learning in Contextual Second-Price Pay-Per-Click Auctions [47.06746975822902]
オンライン学習は、クリック単価のオークションで学習し、そこでは、各ラウンドのT$で、学習者がいくつかのコンテキストと広告を受信する。学習者のゴールは、彼女の後悔を最小限に抑えることであり、それは彼女の総収入と託宣戦略のギャップとして定義される。
論文参考訳（メタデータ） (2023-10-08T07:04:22Z)
Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
政策後悔は、適応的な敵に対してオンライン学習アルゴリズムのパフォーマンスを測定するという、よく確立された概念である。我々は,不完全な政策後悔を効果的に最小化できる敵の制限について検討する。我々は、$tildemathcalO(mKsqrtT)$の完全なポリシーを後悔するアルゴリズムを提供し、$tildemathcalO$表記は対数要素だけを隠す。
論文参考訳（メタデータ） (2022-04-24T03:10:27Z)
Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits [99.86860277006318]
本稿では,一組の逆アルゴリズムを組み合わせ,学習することの問題点について考察する。 Agarwal et al. の CORRAL はこの目標を、$widetildeO(sqrtd S T)$ の残酷なオーバーヘッドで達成している。この問題に触発されて、後悔のオーバーヘッドが百万ドルにしか依存しない大規模バンディットアルゴリズムのバンドを囲む新しいレシピを提案する。
論文参考訳（メタデータ） (2022-02-12T21:55:44Z)
Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays [21.94728545221709]
制限のないフィードバック遅延を伴うMAB(Scale-Free Adversarial Multi Armed Bandit)問題を考える。 textttSFBankerは$mathcal O(sqrtK(D+T)L)cdot rm polylog(T, L)$ total regret, where $T$ is the total number of steps, $D$ is the total feedback delay。
論文参考訳（メタデータ） (2021-10-26T04:06:51Z)
Banker Online Mirror Descent [25.927810369879175]
Banker-OMDは、オンライン学習アルゴリズム設計において古典的オンラインミラー・ディフレッシュ(OMD)技法を一般化する新しいフレームワークである。遅延フィードバックを伴う3つの重要なバンディットシナリオに適用可能なバンクアOMDを実演する。特に、最初の遅延逆線形バンドイットアルゴリズムは、$tildeO(textpoly(n)(sqrtT + sqrtD))$ regretとなる。
論文参考訳（メタデータ） (2021-06-16T16:57:05Z)
Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
本稿では,オンライン有限水平マルコフ決定過程の新たな変種について検討する。各エピソードにおいて、学習者は、エピソードの選択した方針によって実現された軌道に沿って蓄積された損失を被り、総括的盗聴フィードバックを観察する。我々の主な結果は計算効率のよいアルゴリズムで、$O(sqrtK)$ regret for this set, where $K$ is the number of episodes。
論文参考訳（メタデータ） (2021-01-31T16:49:07Z)
Adapting to Delays and Data in Adversarial Multi-Armed Bandits [7.310043452300736]
決定時に利用可能な情報のみを用いてステップサイズを調整するExp3アルゴリズムの変種を分析する。我々は、観測された(最悪の場合ではなく)遅延や損失のシーケンスに適応する後悔の保証を得る。
論文参考訳（メタデータ） (2020-10-12T20:53:52Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
制約付き文脈線形帯域設定について検討し、エージェントの目標は一連のポリシーを作成することである。楽観的悲観的線形帯域(OPLB)と呼ばれる,この問題に対する高信頼束縛アルゴリズムを提案する。
論文参考訳（メタデータ） (2020-06-17T22:32:19Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。