Fugu-MT 論文翻訳(概要): Adversarial Bandits against Arbitrary Strategies

論文の概要: Adversarial Bandits against Arbitrary Strategies

arxiv url: http://arxiv.org/abs/2205.14839v5
Date: Thu, 10 Oct 2024 04:58:15 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-06 02:16:42.652223
Title: Adversarial Bandits against Arbitrary Strategies
Title（参考訳）: 任意戦略に対する対立帯域
Authors: Jung-hun Kim, Se-Young Yun,
Abstract要約: 本稿では,任意の戦略に対して,敵対的盗賊問題について検討する。オンラインミラー降下法(OMD)を用いたマスタベースフレームワークを採用する。我々は OMD の適応学習率を用いて $tildeO(minmathbbE[sqrtSKTrho_T(hdagger)],SsqrtKT)$, ここで $_T(hdagger)$ は損失推定器の分散項である。
参考スコア（独自算出の注目度）: 28.64435687288574
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the adversarial bandit problem against arbitrary strategies, in which $S$ is the parameter for the hardness of the problem and this parameter is not given to the agent. To handle this problem, we adopt the master-base framework using the online mirror descent method (OMD). We first provide a master-base algorithm with simple OMD, achieving $\tilde{O}(S^{1/2}K^{1/3}T^{2/3})$, in which $T^{2/3}$ comes from the variance of loss estimators. To mitigate the impact of the variance, we propose using adaptive learning rates for OMD and achieve $\tilde{O}(\min\{\mathbb{E}[\sqrt{SKT\rho_T(h^\dagger)}],S\sqrt{KT}\})$, where $\rho_T(h^\dagger)$ is a variance term for loss estimators.
Abstract（参考訳）: 本稿では, 任意の戦略に対して, S$が問題の硬さのパラメータであり, このパラメータがエージェントに与えられない, 逆帯域問題について検討する。この問題に対処するため,オンラインミラー降下法(OMD)を用いたマスタベースフレームワークを採用した。まず、単純な OMD を持つマスターベースアルゴリズムを提供し、損失推定器の分散から$T^{2/3}$ が生じるような$\tilde{O}(S^{1/2}K^{1/3}T^{2/3})$ を達成する。分散の影響を軽減するために, OMD の適応学習率を用いて $\tilde{O}(\min\{\mathbb{E}[\sqrt{SKT\rho_T(h^\dagger)}], S\sqrt{KT}\})$, ここで $\rho_T(h^\dagger)$ は損失推定器の分散項である。

関連論文リスト

Reinforcement Learning from Adversarial Preferences in Tabular MDPs [62.73758165845971]
我々は,敵対的嗜好を持つエピソードマルコフ決定プロセス(MDP)の新たな枠組みを導入する。 PbMDP では、標準的なエピソード MDP とは異なり、学習者は2つの候補アーム間の好みを観察する。我々は、既知遷移の下で、T2/3$という残差境界を達成するアルゴリズムを開発する。
論文参考訳（メタデータ） (2025-07-15T20:19:32Z)
Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback [49.84060509296641]
オンライン有限水平マルコフ決定過程を逆向きに変化した損失と総括的帯域幅フィードバック(フルバンド幅)を用いて研究する。この種のフィードバックの下では、エージェントは、軌跡内の各中間段階における個々の損失よりも、軌跡全体に生じる総損失のみを観察する。この設定のための最初のポリシー最適化アルゴリズムを紹介します。
論文参考訳（メタデータ） (2025-02-06T12:03:24Z)
Stochastic Bandits Robust to Adversarial Attacks [33.278131584647745]
本稿では,敵攻撃に対して頑健なマルチアームバンディットアルゴリズムについて検討する。我々は、攻撃予算の知識の有無に関わらず、このモデルの2つのケースを調査する。我々は、加法的あるいは乗法的な$C$依存項を持つ後悔境界を持つ2種類のアルゴリズムを考案する。
論文参考訳（メタデータ） (2024-08-16T17:41:35Z)
Accelerated Variance-Reduced Forward-Reflected Methods for Root-Finding Problems [8.0153031008486]
そこで本研究では,Nesterovの高速前方反射法と分散還元法を新たに提案し,根絶問題の解法を提案する。我々のアルゴリズムは単ループであり、ルートフィリング問題に特化して設計された非バイアス分散還元推定器の新たなファミリーを利用する。
論文参考訳（メタデータ） (2024-06-04T15:23:29Z)
Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
一般化線形モデル (GLM) フレームワークを用いて, citelu2021low で提案した一般化低ランク行列帯域問題について検討する。既存のアルゴリズムの計算不可能性と理論的制約を克服するため,まずG-ESTTフレームワークを提案する。 G-ESTT は $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret を達成でき、G-ESTS は $tildeO を達成できることを示す。
論文参考訳（メタデータ） (2024-01-14T14:14:19Z)
Best-of-Both-Worlds Linear Contextual Bandits [45.378265414553226]
本研究は, 対向汚職下での多武装盗賊問題の事例である$K$腕線形文脈盗賊の問題を考察する。我々は,理論的保証のもと,双方の敵環境に有効な戦略を開発する。両体制の理論的保証から,我々の戦略をBest-of-Both-Worlds (BoBW) RealFTRLと呼んでいる。
論文参考訳（メタデータ） (2023-12-27T09:32:18Z)
Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
両世界のベスト・オブ・ワールドズ・アルゴリズムを$K$武器付き線形文脈包帯に対して検討する。我々のアルゴリズムは、敵対的体制と敵対的体制の両方において、ほぼ最適の後悔の限界を提供する。
論文参考訳（メタデータ） (2023-12-24T08:27:30Z)
Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
我々のアルゴリズムが $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping is linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ is the standardities of the state and action space。
論文参考訳（メタデータ） (2023-05-15T05:37:32Z)
Revisiting Weighted Strategy for Non-stationary Parametric Bandits [82.1942459195896]
本稿では,非定常パラメトリックバンディットの重み付け戦略を再考する。より単純な重みに基づくアルゴリズムを生成する改良された分析フレームワークを提案する。我々の新しいフレームワークは、他のパラメトリックバンディットの後悔の限界を改善するのに使える。
論文参考訳（メタデータ） (2023-03-05T15:11:14Z)
Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs [111.75736569611159]
線形混合MDPのための計算効率のよい初めての地平線フリーアルゴリズムを提案する。我々のアルゴリズムは、未知の遷移力学に対する重み付き最小二乗推定器に適応する。これにより、$sigma_k2$'sが知られているときに、この設定で最もよく知られたアルゴリズムも改善される。
論文参考訳（メタデータ） (2022-05-23T17:59:18Z)
Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays [21.94728545221709]
制限のないフィードバック遅延を伴うMAB(Scale-Free Adversarial Multi Armed Bandit)問題を考える。 textttSFBankerは$mathcal O(sqrtK(D+T)L)cdot rm polylog(T, L)$ total regret, where $T$ is the total number of steps, $D$ is the total feedback delay。
論文参考訳（メタデータ） (2021-10-26T04:06:51Z)
Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
線形バンドイットと線形混合決定プロセス(mdp)に対する分散認識信頼セットの構築方法を示す。線形バンドイットに対しては、$d を特徴次元とする$widetildeo(mathrmpoly(d)sqrt1 + sum_i=1ksigma_i2) が成り立つ。線形混合 MDP に対し、$widetildeO(mathrmpoly(d)sqrtK)$ regret bound を得る。
論文参考訳（メタデータ） (2021-01-29T18:57:52Z)
Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
我々は,損失予測の助けを借りて,文脈的包帯における学習を研究する。最適な後悔は$mathcalO(minsqrtT, sqrtmathcalETfrac13)$である。
論文参考訳（メタデータ） (2020-03-04T07:36:38Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。