Fugu-MT 論文翻訳(概要): Doubly robust Thompson sampling for linear payoffs

論文の概要: Doubly robust Thompson sampling for linear payoffs

arxiv url: http://arxiv.org/abs/2102.01229v3
Date: Sun, 30 Apr 2023 21:19:54 GMT
ステータス: 翻訳完了
システム内更新日: 2023-05-02 22:09:58.583125
Title: Doubly robust Thompson sampling for linear payoffs
Title（参考訳）: リニアペイオフに対する2重ロバストなトンプソンサンプリング
Authors: Wonyoung Kim, Gi-soo Kim, Myunghee Cho Paik
Abstract要約: 本稿では,Douubly Robust (DR) Thompson Sampling と呼ばれる新しいマルチアームコンテキスト帯域幅アルゴリズムを提案する。提案アルゴリズムは, 新たな補遺分解を許容し, $tildeO(phi-2sqrtT)$の順序で補遺を改良する。
参考スコア（独自算出の注目度）: 12.375561840897742
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A challenging aspect of the bandit problem is that a stochastic reward is observed only for the chosen arm and the rewards of other arms remain missing. The dependence of the arm choice on the past context and reward pairs compounds the complexity of regret analysis. We propose a novel multi-armed contextual bandit algorithm called Doubly Robust (DR) Thompson Sampling employing the doubly-robust estimator used in missing data literature to Thompson Sampling with contexts (\texttt{LinTS}). Different from previous works relying on missing data techniques (\citet{dimakopoulou2019balanced}, \citet{kim2019doubly}), the proposed algorithm is designed to allow a novel additive regret decomposition leading to an improved regret bound with the order of $\tilde{O}(\phi^{-2}\sqrt{T})$, where $\phi^2$ is the minimum eigenvalue of the covariance matrix of contexts. This is the first regret bound of \texttt{LinTS} using $\phi^2$ without the dimension of the context, $d$. Applying the relationship between $\phi^2$ and $d$, the regret bound of the proposed algorithm is $\tilde{O}(d\sqrt{T})$ in many practical scenarios, improving the bound of \texttt{LinTS} by a factor of $\sqrt{d}$. A benefit of the proposed method is that it utilizes all the context data, chosen or not chosen, thus allowing to circumvent the technical definition of unsaturated arms used in theoretical analysis of \texttt{LinTS}. Empirical studies show the advantage of the proposed algorithm over \texttt{LinTS}.
Abstract（参考訳）: バンドイット問題における挑戦的な側面は、選択された腕のみに確率的な報酬が観察され、他の腕の報酬が失われることである。 arm選択が過去の状況と報酬対に依存することは、後悔分析の複雑さを複雑にする。本稿では,不足データ文学において用いられる二重ロバスト推定器を用いて,文脈付きトンプソンサンプリング(\texttt{lints})を用いた,dubly robust (dr) thompson samplingと呼ばれる新しいマルチアーム付きコンテキストバンディットアルゴリズムを提案する。不足するデータ技術に依存する以前の著作と異なり(\citet{dimakopoulou 2019balanced}, \citet{kim2019doubly})、提案されたアルゴリズムは、文脈の共分散行列の最小固有値である$\tilde{o}(\phi^{-2}\sqrt{t})$という順序で結びついた、新しい加法的な後悔分解を可能にするように設計されている。これは、コンテキストの次元を持たない$\phi^2$ を使用した \textt{lints} の最初の後悔値である。 $\phi^2$ と $d$ の関係を適用すると、提案アルゴリズムの後悔境界は $\tilde{O}(d\sqrt{T})$ であり、多くの現実シナリオにおいて$\sqrt{d}$ の係数で \texttt{LinTS} の境界を改善する。提案手法の利点は、選択されるか選択されないかの全ての文脈データを利用することで、<texttt{LinTS} の理論解析に使用される不飽和アームの技術的定義を回避することができることである。実験的な研究は、提案アルゴリズムの利点を \texttt{LinTS} に対して示す。

関連論文リスト

Contextual Bandits for Unbounded Context Distributions [20.923880900127774]
非パラメトリックな文脈的包帯は、シーケンシャルな意思決定問題の重要なモデルである。 UCB探査と近接する2つの手法を提案する。本手法は, 限界条件が弱い場合に, 最小限の後悔を達成できることを示す。 2つ目の方法は、$tildeOleft(T1-frac(alpha+1)betaalpha+(d+2)beta+T1-betaright)$の期待された後悔を達成する。
論文参考訳（メタデータ） (2024-08-19T02:30:37Z)
Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
人からのフィードバックから学ぶことは、大言語モデル(LLM)のような生成モデルを調整する上で重要な役割を果たす本稿では,このドメイン内のモデルについて考察する。-文脈的デュエルバンディット(contextual dueling bandits)と,正の選好ラベルを相手によって反転させることができる対向フィードバック(reversarial feedback)について考察する。本稿では,不確実性重み付き最大推定に基づく頑健なコンテキストデュエルバンドイット(RCDB)を提案する。
論文参考訳（メタデータ） (2024-04-16T17:59:55Z)
Feel-Good Thompson Sampling for Contextual Dueling Bandits [49.450050682705026]
FGTS.CDBという名前のトンプソンサンプリングアルゴリズムを提案する。われわれのアルゴリズムの核心は、デュエルバンディットに適した新しいFeel-Good探索用語である。我々のアルゴリズムは最小限の誤差、すなわち $tildemathcalO(dsqrt T)$, $d$ はモデル次元、$T$ は時間水平線である。
論文参考訳（メタデータ） (2024-04-09T04:45:18Z)
LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits [38.41164102066483]
本研究では、独立かつ同一に分散したコンテキストを持つ線形文脈帯域問題について考察する。提案アルゴリズムは、Tsallisエントロピーを持つFollow-The-Regularized-Leaderに基づいており、$alpha$-textual-Con (LC)-Tsallis-INFと呼ばれている。
論文参考訳（メタデータ） (2024-03-05T18:59:47Z)
Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
一般化線形モデル (GLM) フレームワークを用いて, citelu2021low で提案した一般化低ランク行列帯域問題について検討する。既存のアルゴリズムの計算不可能性と理論的制約を克服するため,まずG-ESTTフレームワークを提案する。 G-ESTT は $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret を達成でき、G-ESTS は $tildeO を達成できることを示す。
論文参考訳（メタデータ） (2024-01-14T14:14:19Z)
Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
本稿では, 一般化線形モデル(GLM)から, デュエルアームのバイナリ比較を生成するコンテキストデュエルバンド問題について検討する。本稿では,SupLinUCB型アルゴリズムを提案する。このアルゴリズムは,計算効率と分散を意識したリセットバウンド$tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$を提案する。我々の後悔は、比較が決定論的である場合の直感的な期待と自然に一致し、アルゴリズムは$tilde O(d)$ regretにのみ悩まされる。
論文参考訳（メタデータ） (2023-10-02T08:15:52Z)
Double Doubly Robust Thompson Sampling for Generalized Linear Contextual Bandits [8.508198765617198]
一般化線形報酬に$tildeO(sqrtkappa-1 phi T)$ regret over $T$ roundsを提案する。また、確率的マージン条件下では、$O(kappa-1 phi log (NT) log T)$ regret bound for $N$ arms も提供する。
論文参考訳（メタデータ） (2022-09-15T00:20:38Z)
Squeeze All: Novel Estimator and Self-Normalized Bound for Linear Contextual Bandits [18.971564419292893]
我々は、$O(sqrtdTlog T)$ regret boundで、$d$は文脈の次元、$T$は時間地平線であるような線形文脈的帯域幅アルゴリズムを提案する。提案アルゴリズムは,探索を明示的ランダム化により埋め込んだ新しい推定器を備える。
論文参考訳（メタデータ） (2022-06-11T02:43:17Z)
Breaking the $\sqrt{T}$ Barrier: Instance-Independent Logarithmic Regret in Stochastic Contextual Linear Bandits [10.127456032874978]
線形ペイオフを伴う文脈的包帯に対する対数的後悔(多元的後悔)を証明した。コンテキストは、$sqrtT$から$polylog(T)$への後悔を減らすのに役立ちます。
論文参考訳（メタデータ） (2022-05-19T23:41:46Z)
Stochastic Contextual Dueling Bandits under Linear Stochastic Transitivity Models [25.336599480692122]
我々は,コンテキスト情報を用いた決闘バンディット問題における後悔の最小化タスクについて検討する。本稿では,フィードバックプロセスの模倣に基づく計算効率のよいアルゴリズムである$texttCoLSTIM$を提案する。本実験は,CoLSTモデルの特殊事例に対する最先端アルゴリズムよりも優れていることを示す。
論文参考訳（メタデータ） (2022-02-09T17:44:19Z)
Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso banditは、報酬関数を定義するベクトルとスパースサポートを推定するアルゴリズムである。一般には $mathcalO( log d + sqrtT )$ や $mathcalO( log d + sqrtT )$ としてスケールする非漸近的後悔の上界を確立する。
論文参考訳（メタデータ） (2020-10-22T19:14:37Z)
Stochastic Linear Bandits Robust to Adversarial Attacks [117.665995707568]
我々はロバスト位相除去アルゴリズムの2つの変種を提供し、その1つは$C$を知っており、もう1つはそうでない。いずれの変種も、倒壊しない場合には、それぞれ$C = 0$ となり、それぞれ追加の加法項が生じる。文脈的設定では、単純な欲求的アルゴリズムは、明示的な探索を行わず、C$を知らないにもかかわらず、ほぼ最適加法的後悔項で証明可能な堅牢性を示す。
論文参考訳（メタデータ） (2020-07-07T09:00:57Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
制約付き文脈線形帯域設定について検討し、エージェントの目標は一連のポリシーを作成することである。楽観的悲観的線形帯域(OPLB)と呼ばれる,この問題に対する高信頼束縛アルゴリズムを提案する。
論文参考訳（メタデータ） (2020-06-17T22:32:19Z)
Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
我々は,損失予測の助けを借りて,文脈的包帯における学習を研究する。最適な後悔は$mathcalO(minsqrtT, sqrtmathcalETfrac13)$である。
論文参考訳（メタデータ） (2020-03-04T07:36:38Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。