Fugu-MT 論文翻訳(概要): An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits

論文の概要: An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits

arxiv url: http://arxiv.org/abs/2412.02861v2
Date: Thu, 20 Feb 2025 18:24:53 GMT
ステータス: 翻訳完了
システム内更新日: 2025-02-21 15:38:29.469714
Title: An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits
Title（参考訳）: ロジスティックバンドのためのトンプソンサンプリングの情報理論解析
Authors: Amaury Gouverneur, Borja Rodríguez-Gálvez, Tobias J. Oechtering, Mikael Skoglund,
Abstract要約: 本稿では,ロジスティックバンディット問題に対するトンプソンサンプリングアルゴリズムの性能について検討する。我々は、$O(d/alphaqrtT log(beta T/d))$ of regret incurred after $T$ expected of Smpling steps.
参考スコア（独自算出の注目度）: 36.37704574907495
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, $\exp(\beta \langle a, \theta \rangle)/(1+\exp(\beta \langle a, \theta \rangle))$, with slope parameter $\beta>0$, and where both the action $a\in \mathcal{A}$ and parameter $\theta \in \mathcal{O}$ lie within the $d$-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo and Van Roy (2016), we analyze the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred and the information gained about the optimal action. We improve upon previous results by establishing that the information ratio is bounded by $\tfrac{9}{2}d\alpha^{-2}$, where $\alpha$ is a minimax measure of the alignment between the action space $\mathcal{A}$ and the parameter space $\mathcal{O}$, and is independent of $\beta$. Using this result, we derive a bound of order $O(d/\alpha\sqrt{T \log(\beta T/d)})$ on the Bayesian expected regret of Thompson Sampling incurred after $T$ time steps. To our knowledge, this is the first regret bound for logistic bandits that depends only logarithmically on $\beta$ while being independent of the number of actions. In particular, when the action space contains the parameter space, the bound on the expected regret is of order $\tilde{O}(d \sqrt{T})$.
Abstract（参考訳）: 本稿では,ロジスティックバンディット問題に対するトンプソンサンプリングアルゴリズムの性能について検討する。この設定では、エージェントは、ロジスティック関数によって決定される確率で二進報酬を受ける。 $\exp(\beta \langle a, \theta \rangle)/(1+\exp(\beta \langle a, \theta \rangle))$, with slope parameter $\beta>0$, and where the action $a\in \mathcal{A}$ and parameters $\theta \in \mathcal{O}$は、$d$次元単位球内にある。 Russo と Van Roy (2016) が導入した情報理論の枠組みを応用し、情報比率、即時後悔と最適な行動に関する情報との間のトレードオフを定量化する統計分析を行う。我々は、情報比が$\tfrac{9}{2}d\alpha^{-2}$で有界であることを確立することにより、以前の結果を改善する。$\alpha$はアクション空間$\mathcal{A}$とパラメータ空間$\mathcal{O}$のアライメントの最小値測度であり、$\beta$とは独立である。この結果を用いて、ベイジアン予想されるトンプソンサンプリングのT$時間ステップの後に生じた後悔について、位数$O(d/\alpha\sqrt{T \log(\beta T/d)})$を導出する。私たちの知る限り、これはロジスティックな包帯に対する最初の後悔であり、アクションの数に依存しながら$\beta$に対数的にのみ依存する。特に、アクション空間がパラメータ空間を含むとき、期待される後悔の束縛は$\tilde{O}(d \sqrt{T})$である。

関連論文リスト

New Rates in Stochastic Decision-Theoretic Online Learning under Differential Privacy [17.711455925206298]
HuとMehta(2024年)は、オープンな問題を提起した:$varepsilon$-differential privacyの下で、決定論的オンライン学習($K$アクションと$T$ラウンドを含む)の最適なインスタンス依存率は何ですか? 本稿では,2つの新しい結果を得ることで,この問題に部分的に対処する。まず,$Oleft(fraclog KDelta_min + fraclog2Kvarepsilonright)$。第二に
論文参考訳（メタデータ） (2025-02-16T05:13:51Z)
Cooperative Multi-Agent Constrained Stochastic Linear Bandits [2.099922236065961]
N$エージェントのネットワークがローカルに通信し、期待されるコストを所定の閾値$tau$で保持しながら、全体的な後悔を最小限に抑える。我々は、textitMA-OPLBと呼ばれる安全な分散上信頼度有界アルゴリズムを提案し、そのT$ラウンドの後悔に基づく高い確率を確立する。我々の後悔の限界は次数$ MathcalOleft(fracdtau-c_0fraclog(NT)2sqrtNsqrtTlog (1/|lambda|)であることを示す。
論文参考訳（メタデータ） (2024-10-22T19:34:53Z)
Low-rank Matrix Bandits with Heavy-tailed Rewards [55.03293214439741]
アンダーライン重み付きアンダーラインリワード(LowHTR)を用いたアンダーラインローランク行列バンディットの問題点について検討する。観測されたペイオフに対するトランケーションと動的探索を利用して,LOTUSと呼ばれる新しいアルゴリズムを提案する。
論文参考訳（メタデータ） (2024-04-26T21:54:31Z)
Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks [93.00280593719513]
本稿では,オンラインインタラクションのT$ステップをバッチに分割したバッチフィードバックによる高次元マルチアームコンテキストバンドレットについて検討する。具体的には、各バッチは以前のバッチに依存するポリシーに従ってデータを収集し、その報酬はバッチの最後にのみ明らかにする。我々のアルゴリズムは,$mathcalO( log T)$ バッチで完全に逐次的に設定されたものに匹敵する後悔の限界を達成している。
論文参考訳（メタデータ） (2023-11-22T06:06:54Z)
Context-lumpable stochastic bandits [49.024050919419366]
我々は、$S$コンテキストと$K$アクションによる文脈的盗賊問題を考える。我々は,最大$widetilde O(r (S +K )/epsilon2)$サンプルを用いて,$epsilon$-optimal Policyを出力するアルゴリズムを提案する。後悔の設定では、T$までの累積後悔を$widetilde O(sqrtr3(S+K)T)$で束縛するアルゴリズムを与える。
論文参考訳（メタデータ） (2023-06-22T17:20:30Z)
Asymptotically Optimal Pure Exploration for Infinite-Armed Bandits [4.811176167998627]
我々は、未知の分布から生じる無限に多くのバンドイットアームを用いて純粋探索を研究する。私たちのゴールは、平均的な報酬が1-delta$の1つの高品質なアームを、最高の$eta$-fraction of armsの1つとして$varepsilon$内で効率的に選択することにあります。
論文参考訳（メタデータ） (2023-06-03T04:00:47Z)
Sparse Recovery with Shuffled Labels: Statistical Limits and Practical Estimators [23.313461266708877]
置換行列 $bPitrue$ とスパース信号 $bbetatrue$ をシャッフルラベルから再構成する。提案した推定器は, 穏やかな条件下で, 基本トラス$(bPitrue, supp(bbetatrue))$が得られることを示す。
論文参考訳（メタデータ） (2023-03-20T16:14:58Z)
Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
本研究では,有限水平マルコフ決定過程(MDP)によってモデル化されたエピソディック強化学習(RL)問題をバッチ数に制約を加えて検討する。我々は,$tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot)をほぼ最適に後悔するアルゴリズムを設計し,$(S,A,H,K)$の対数項を$K$で隠蔽する。技術的貢献は2つある: 1) 探索のためのほぼ最適設計スキーム
論文参考訳（メタデータ） (2022-10-15T09:22:22Z)
Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP [46.86114958340962]
我々は,最小到達可能性仮定の下での文脈的MDPに対する後悔のアルゴリズムを提案する。我々のアプローチは、一般関数近似を用いた文脈的MDPに適用された最初の楽観的アプローチである。
論文参考訳（メタデータ） (2022-07-22T15:00:15Z)
Threshold Phenomena in Learning Halfspaces with Massart Noise [56.01192577666607]
ガウス境界の下でのマスアートノイズ付きmathbbRd$におけるPAC学習ハーフスペースの問題について検討する。この結果は,Massartモデルにおける学習ハーフスペースの複雑さを定性的に特徴づけるものである。
論文参考訳（メタデータ） (2021-08-19T16:16:48Z)
Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
タスクに依存しない強化学習のための効率的なアルゴリズムを提案する。このアルゴリズムは1/epsilon cdot (H3SA / rho + H4 S2 A) の$widetildemathcalOのみを探索する。情報理論上、この境界は$rho Theta (1/(HS))$と$H>1$に対してほぼ厳密であることを示す。
論文参考訳（メタデータ） (2021-08-11T20:42:46Z)
Cascading Bandit under Differential Privacy [21.936577816668944]
本研究では,カスケードバンドにおける自己差分プライバシー(DP)と局所差分プライバシー(LDP)について検討する。 DPでは,$epsilon$-indistinguishability を保証し,$mathcalO(fraclog3 Tepsilon)$を任意の小さな$xi$に対して後悔するアルゴリズムを提案する。 Epsilon$,$delta$)-LDPの下では、プライバシの予算とエラー確率のトレードオフを通じて、K2$依存を緩和します。
論文参考訳（メタデータ） (2021-05-24T07:19:01Z)
Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation [92.3161051419884]
我々は、敵対的な報酬と完全な情報フィードバックで有限正方体エピソディックマルコフ決定プロセスのための強化学習を研究します。我々は、$tildeO(dHsqrtT)$ regretを達成できることを示し、$H$はエピソードの長さである。また、対数因子までの$tildeOmega(dHsqrtT)$の値が一致することを証明する。
論文参考訳（メタデータ） (2021-02-17T18:54:08Z)
Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso banditは、報酬関数を定義するベクトルとスパースサポートを推定するアルゴリズムである。一般には $mathcalO( log d + sqrtT )$ や $mathcalO( log d + sqrtT )$ としてスケールする非漸近的後悔の上界を確立する。
論文参考訳（メタデータ） (2020-10-22T19:14:37Z)
$Q$-learning with Logarithmic Regret [60.24952657636464]
楽観的な$Q$は$mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of state, $A$ is the number of action, $H$ is the planning horizon, $T$ is the total number of steps, $Delta_min$ is the least sub-Optitimality gap。
論文参考訳（メタデータ） (2020-06-16T13:01:33Z)
Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed Payoffs [35.988644745703645]
我々は、リニアバンディットをヘビーテールのペイオフで分析し、そこではペイオフは1+epsilon$のモーメントしか持たない。本稿では,$widetildeO(dfrac12Tfrac11+epsilon)$のサブ線形後悔境界を満足する2つの新しいアルゴリズムを提案する。
論文参考訳（メタデータ） (2020-04-28T13:01:38Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。