Fugu-MT 論文翻訳(概要): A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback

論文の概要: A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback

arxiv url: http://arxiv.org/abs/2206.14906v1
Date: Wed, 29 Jun 2022 20:49:45 GMT
ステータス: 翻訳完了
システム内更新日: 2022-07-01 13:34:33.749158
Title: A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback
Title（参考訳）: 遅延フィードバックをもつ帯域に対するBest-of-Both-Worldsアルゴリズム
Authors: Saeed Masoudian, Julian Zimmert, Yevgeny Seldin
Abstract要約: 本稿では,Zimmert と Seldin [2020] のアルゴリズムを,フィードバックの遅れによる逆方向の多重武装バンディットに対して修正したチューニングを行う。我々は,時間的遅れのある設定において,ほぼ最適の相反的後悔の保証を同時に達成する。また,任意の遅延の場合に対するアルゴリズムの拡張も提案する。
参考スコア（独自算出の注目度）: 25.68113242132723
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback, which in addition to the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin simultaneously achieves a near-optimal regret guarantee in the stochastic setting with fixed delays. Specifically, the adversarial regret guarantee is $\mathcal{O}(\sqrt{TK} + \sqrt{dT\log K})$, where $T$ is the time horizon, $K$ is the number of arms, and $d$ is the fixed delay, whereas the stochastic regret guarantee is $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{d}{\Delta_{i}\log K}) + d K^{1/3}\log K\right)$, where $\Delta_i$ are the suboptimality gaps. We also present an extension of the algorithm to the case of arbitrary delays, which is based on an oracle knowledge of the maximal delay $d_{max}$ and achieves $\mathcal{O}(\sqrt{TK} + \sqrt{D\log K} + d_{max}K^{1/3} \log K)$ regret in the adversarial regime, where $D$ is the total delay, and $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{\sigma_{max}}{\Delta_{i}\log K}) + d_{max}K^{1/3}\log K\right)$ regret in the stochastic regime, where $\sigma_{max}$ is the maximal number of outstanding observations. Finally, we present a lower bound that matches regret upper bound achieved by the skipping technique of Zimmert and Seldin [2020] in the adversarial setting.
Abstract（参考訳）: 本稿では,zimmert と seldin が提示するminimax の最適逆後悔保証に加えて,遅延が固定された確率的設定において近似的後悔保証を同時に達成する,逆多腕バンディットに対する zimmert と seldin [2020] のアルゴリズムの修正チューニングを提案する。具体的には、逆後悔保証は$\mathcal{O}(\sqrt{TK} + \sqrt{dT\log K})$, where $T$ is the time horizon, $K$ is the number of arms, $d$ is the fixed delay, $d$ is the stochastic regret guarantee is $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{d}{\Delta_{i}\log K}) + dK^{1/3}\log K\right)$である。 We also present an extension of the algorithm to the case of arbitrary delays, which is based on an oracle knowledge of the maximal delay $d_{max}$ and achieves $\mathcal{O}(\sqrt{TK} + \sqrt{D\log K} + d_{max}K^{1/3} \log K)$ regret in the adversarial regime, where $D$ is the total delay, and $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{\sigma_{max}}{\Delta_{i}\log K}) + d_{max}K^{1/3}\log K\right)$ regret in the stochastic regime, where $\sigma_{max}$ is the maximal number of outstanding observations. 最後に, ジマートとセルディン [2020] のスキッピング技術によって達成された, 敵意設定における上界の後悔と一致する下界を示す。

関連論文リスト

Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs [60.7808741738461]
我々は,遅延フィードバックのために,過去のラウンドを同時に追跡できる回数を制限する新しい「透明度」の下で,目立った損失の遅延を伴ってオンライン学習を研究する。我々のアルゴリズムは、全てのキャパシティレベルにおいて、最適以下のキャパシティの優雅な性能で、最小最適後悔を実現する。
論文参考訳（メタデータ） (2025-03-25T17:20:39Z)
Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
遅延フィードバックを伴うバンド凸最適化(BCO)。我々は,新しいアルゴリズムを開発し,一般にO(sqrtnT3/4+sqrtdT)$の後悔境界を満足していることを証明する。提案アルゴリズムは,強い凸関数に対して$O((nT)2/3log/3T+dlog T)$に制限された後悔を改善することができることを示す。
論文参考訳（メタデータ） (2024-02-14T13:08:26Z)
Minimax Optimal Submodular Optimization with Bandit Feedback [13.805872311596739]
単調な部分モジュラー集合関数 $f: 2[n] rightarrow [0,1]$ をバンドイットフィードバックの下で最大化する。具体的には、$f$は学習者には知られていないが、各時点で$t=1,dots,T$は、$|S_t| leq k$でセットの$S_tサブセット[n]$を選択し、$eta_t$が平均ゼロのサブガウスノイズである場合に、$f(S_t) + eta_t$を受け取る。
論文参考訳（メタデータ） (2023-10-27T20:19:03Z)
Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
本研究では,有限水平マルコフ決定過程(MDP)によってモデル化されたエピソディック強化学習(RL)問題をバッチ数に制約を加えて検討する。我々は,$tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot)をほぼ最適に後悔するアルゴリズムを設計し,$(S,A,H,K)$の対数項を$K$で隠蔽する。技術的貢献は2つある: 1) 探索のためのほぼ最適設計スキーム
論文参考訳（メタデータ） (2022-10-15T09:22:22Z)
Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays [21.94728545221709]
制限のないフィードバック遅延を伴うMAB(Scale-Free Adversarial Multi Armed Bandit)問題を考える。 textttSFBankerは$mathcal O(sqrtK(D+T)L)cdot rm polylog(T, L)$ total regret, where $T$ is the total number of steps, $D$ is the total feedback delay。
論文参考訳（メタデータ） (2021-10-26T04:06:51Z)
Differentially Private Multi-Armed Bandits in the Shuffle Model [58.22098764071924]
シャッフルモデルにおけるマルチアームバンディット(MAB)問題に対して,$(varepsilon,delta)$-differentially privateアルゴリズムを提案する。我々の上限は、集中モデルにおいて最もよく知られたアルゴリズムの後悔とほぼ一致し、局所モデルにおいて最もよく知られたアルゴリズムを著しく上回っている。
論文参考訳（メタデータ） (2021-06-05T14:11:01Z)
Improved Analysis of Robustness of the Tsallis-INF Algorithm to Adversarial Corruptions in Stochastic Multiarmed Bandits [12.462608802359936]
Zimmert and Seldin (2021) の Tsallis-INF アルゴリズムに対する後悔の境界を改善した。特に、$C = Thetaleft(fracTlog Tlog T$)$の場合、$T$が時空である場合、乗算因子による改善を達成します。また, time horizon 上の後悔の依存性を $log t$ から $log frac(k-1)t(sum_ineq i*frac1delta_ に改善する。
論文参考訳（メタデータ） (2021-03-23T12:26:39Z)
Bandits with many optimal arms [68.17472536610859]
最適アームの割合は$p*$、最適アームとサブ最適化アームの間の最小平均ギャップは$Delta$と書きます。我々は,累積的後悔設定と最良腕識別設定の両方において最適な学習率を特徴付ける。
論文参考訳（メタデータ） (2021-03-23T11:02:31Z)
An Algorithm for Stochastic and Adversarial Bandits with Switching Costs [10.549307055348596]
そこで本研究では,マルチアームバンディットのスイッチングコストを考慮したアルゴリズムを提案し,そのアルゴリズムがアームを切り替える度に$lambda$を支払う。私たちのアルゴリズムは、Zimmert and Seldin(2021)のTsallis-INFアルゴリズムの適応に基づいています。
論文参考訳（メタデータ） (2021-02-19T11:03:51Z)
Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
我々は,バンディットフィードバックを用いてオンライン学習を学習する。 learnerは、コスト/リワード関数が"pseudo-1d"構造を許可するゼロ次オラクルのみにアクセスできる。我々は、$T$がラウンドの数である任意のアルゴリズムの後悔のために$min(sqrtdT、T3/4)$の下限を示しています。ランダム化オンライングラデーション下降とカーネル化指数重み法を組み合わせた新しいアルゴリズムsbcalgを提案し,疑似-1d構造を効果的に活用する。
論文参考訳（メタデータ） (2021-02-15T08:16:51Z)
Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
我々は,損失予測の助けを借りて,文脈的包帯における学習を研究する。最適な後悔は$mathcalO(minsqrtT, sqrtmathcalETfrac13)$である。
論文参考訳（メタデータ） (2020-03-04T07:36:38Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。