Fugu-MT 論文翻訳(概要): Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

論文の概要: Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

arxiv url: http://arxiv.org/abs/2304.14989v4
Date: Thu, 11 Apr 2024 20:34:38 GMT
ステータス: 翻訳完了
システム内更新日: 2024-04-15 20:15:54.656794
Title: Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards
Title（参考訳）: Kullback-Leibler Maillard Smpling for Multi-armed Bandits with bounded Rewards
Authors: Hao Qin, Kwang-Sung Jun, Chicheng Zhang,
Abstract要約: Maillard sample citemaillard13apprentissage is shown to achieve competitive regret guarantees in the sub-Gaussian reward setting citebian2022maillard KL-reibler Maillard Smpling (KL-MS) アルゴリズムを提案する。
参考スコア（独自算出の注目度）: 24.487235945761913
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study $K$-armed bandit problems where the reward distributions of the arms are all supported on the $[0,1]$ interval. It has been a challenge to design regret-efficient randomized exploration algorithms in this setting. Maillard sampling \cite{maillard13apprentissage}, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting \cite{bian2022maillard} while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we propose the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling for achieving KL-style gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has a worst-case regret bound of the form $O(\sqrt{\mu^*(1-\mu^*) K T \ln K} + K \ln T)$, where $\mu^*$ is the expected reward of the optimal arm, and $T$ is the time horizon length.
Abstract（参考訳）: 我々は、腕の報酬分布がすべて$[0,1]$間隔で支えられるような、$K$の武器付きバンディット問題を研究する。この環境では、後悔効率の悪いランダム化探索アルゴリズムを設計することが難しかった。 Maillard sample \cite{maillard13apprentissage} は、トンプソンサンプリングの魅力的な代替品であるが、最近、オフラインポリシー評価に有用なクローズドフォームの動作確率を維持しながら、サブガウスの報酬設定 \cite{bian2022maillard} において、競合する後悔の保証を達成することが示されている。本研究では,KL-Leibler Maillard Smpling (KL-MS)アルゴリズムを提案する。我々は、KL-MSがベルヌーイであるときに漸近的最適性を楽しむことを示し、$O(\sqrt{\mu^*(1-\mu^*) K T \ln K} + K \ln T)$という形の最悪の後悔境界を持つことを示す。

関連論文リスト

Follow-the-Perturbed-Leader Approaches Best-of-Both-Worlds for the m-Set Semi-Bandit Problems [18.74680577173648]
我々は、学習者が$m$の腕から$m$の腕を正確に選択する、$m$セット半帯域問題の一般的な場合を考える。また, Fr'echet 摂動を持つFTPL は, 対向的な設定で, $mathcalO(sqrtnm(sqrtdlog(d)+m5/6)$ をほぼ最適に再現できることを示す。
論文参考訳（メタデータ） (2025-04-09T22:07:01Z)
Communication-Constrained Bandits under Additive Gaussian Noise [111.06688156723018]
クライアントが学習者にコミュニケーション制約のあるフィードバックを提供する分散マルチアームバンディットについて検討する。我々は、この下限を小さな加法係数にマッチさせるマルチフェーズ帯域幅アルゴリズム、$mathtUEtext-UCB++$を提案する。
論文参考訳（メタデータ） (2023-04-25T09:31:20Z)
Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits [88.21288104408556]
本研究では,指数関数族バンドイットに対するトンプソンサンプリング (TS) アルゴリズムの遺残について検討する。最適な腕の過小評価を避けるために,新しいサンプリング分布を用いたトンプソンサンプリング(Expulli)を提案する。
論文参考訳（メタデータ） (2022-06-07T18:08:21Z)
On Submodular Contextual Bandits [92.45432756301231]
作用が基底集合の部分集合であり、平均報酬が未知の単調部分モジュラ函数によってモデル化されるような文脈的包帯の問題を考える。 Inverse Gap Weighting 戦略により,提案アルゴリズムは推定関数の局所的最適度を効率よくランダム化することを示す。
論文参考訳（メタデータ） (2021-12-03T21:42:33Z)
Maillard Sampling: Boltzmann Exploration Done Optimally [11.282341369957216]
この論文は、$K$武装バンディット問題に対するランダム化アルゴリズムを示す。メイラードサンプリング(MS)は、各アームを閉じた形で選択する確率を計算する。最適性を失わずに$sqrtKTlogK$に制限された最小値を改善するMS$+$というMSの変種を提案する。
論文参考訳（メタデータ） (2021-11-05T06:50:22Z)
Combinatorial Bandits without Total Order for Arms [52.93972547896022]
セット依存報酬分布を捕捉し、武器の合計順序を仮定しない報酬モデルを提案する。我々は、新しい後悔分析を開発し、$Oleft(frack2 n log Tepsilonright)$ gap-dependent regret boundと$Oleft(k2sqrtn T log Tright)$ gap-dependent regret boundを示す。
論文参考訳（メタデータ） (2021-03-03T23:08:59Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
制約付き文脈線形帯域設定について検討し、エージェントの目標は一連のポリシーを作成することである。楽観的悲観的線形帯域(OPLB)と呼ばれる,この問題に対する高信頼束縛アルゴリズムを提案する。
論文参考訳（メタデータ） (2020-06-17T22:32:19Z)
MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
トンプソンサンプリングがミニマックス下限の$Omega(sqrtKT)$と$K$の武器付きバンディット問題に一致するかどうかという未解決の問題のままである。我々は,各タイミングで選択した腕のサンプリングインスタンスを適応的にクリップするMOTSと呼ばれるトンプソンサンプリングの変種を提案する。我々は、この単純なトンプソンサンプリングの変種が、有限時間地平線に対して$O(sqrtKT)$のミニマックス最適後悔境界と、$T$が無限に近づくときのガウス報酬に対する最適後悔境界を達成することを証明した。
論文参考訳（メタデータ） (2020-03-03T21:24:39Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。