Fugu-MT 論文翻訳(概要): Doubly-Adaptive Thompson Sampling for Multi-Armed and Contextual Bandits

論文の概要: Doubly-Adaptive Thompson Sampling for Multi-Armed and Contextual Bandits

arxiv url: http://arxiv.org/abs/2102.13202v1
Date: Thu, 25 Feb 2021 22:29:25 GMT
ステータス: 翻訳完了
システム内更新日: 2021-03-02 00:00:08.734578
Title: Doubly-Adaptive Thompson Sampling for Multi-Armed and Contextual Bandits
Title（参考訳）: マルチアーマドおよびコンテクスチュアルバンドのための二重適応トンプソンサンプリング
Authors: Maria Dimakopoulou, Zhimei Ren, Zhengyuan Zhou
Abstract要約: 本稿では,トンプソンサンプリングに基づくアルゴリズムの変種について,両腕の真の平均報酬に対する2倍頑健な推定器の項を適応的に再検討する。提案アルゴリズムは, 半合成実験における最適(最小)後悔率とその経験的評価に適合する。このアプローチは、適応データ収集とは別に、より多くのバイアス源が存在するコンテキスト的包帯に拡張する。
参考スコア（独自算出の注目度）: 28.504921333436833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To balance exploration and exploitation, multi-armed bandit algorithms need to conduct inference on the true mean reward of each arm in every time step using the data collected so far. However, the history of arms and rewards observed up to that time step is adaptively collected and there are known challenges in conducting inference with non-iid data. In particular, sample averages, which play a prominent role in traditional upper confidence bound algorithms and traditional Thompson sampling algorithms, are neither unbiased nor asymptotically normal. We propose a variant of a Thompson sampling based algorithm that leverages recent advances in the causal inference literature and adaptively re-weighs the terms of a doubly robust estimator on the true mean reward of each arm -- hence its name doubly-adaptive Thompson sampling. The regret of the proposed algorithm matches the optimal (minimax) regret rate and its empirical evaluation in a semi-synthetic experiment based on data from a randomized control trial of a web service is performed: we see that the proposed doubly-adaptive Thompson sampling has superior empirical performance to existing baselines in terms of cumulative regret and statistical power in identifying the best arm. Further, we extend this approach to contextual bandits, where there are more sources of bias present apart from the adaptive data collection -- such as the mismatch between the true data generating process and the reward model assumptions or the unequal representations of certain regions of the context space in initial stages of learning -- and propose the linear contextual doubly-adaptive Thompson sampling and the non-parametric contextual doubly-adaptive Thompson sampling extensions of our approach.
Abstract（参考訳）: 探索と搾取のバランスをとるために、マルチアームのバンディットアルゴリズムは、これまでに収集されたデータを使用して、各腕の真の平均報酬に関する推論を行う必要があります。しかし、その段階で観測された腕と報酬の歴史は適応的に収集され、非iidデータによる推論を行う上での課題が知られている。特に、従来の高信頼結合アルゴリズムや伝統的なトンプソンサンプリングアルゴリズムにおいて顕著な役割を果たすサンプル平均は、偏りなくも漸近的にも正常でもない。本稿では,Thompsonサンプリングに基づくアルゴリズムの変種を提案し,因果推論文献の最近の進歩を利用して,各アームの真の平均報酬に対する2倍堅牢な推定値の条件を適応的に再重み付けする。提案アルゴリズムの後悔は、Webサービスのランダム化制御試験のデータに基づく半合成実験において、最適(最小)後悔率とその経験的評価と一致し、提案した2倍適応型トンプソンサンプリングは、最適腕を特定する際の累積的後悔と統計的パワーの観点から、既存のベースラインよりも優れた経験的性能を有することを示す。 Further, we extend this approach to contextual bandits, where there are more sources of bias present apart from the adaptive data collection -- such as the mismatch between the true data generating process and the reward model assumptions or the unequal representations of certain regions of the context space in initial stages of learning -- and propose the linear contextual doubly-adaptive Thompson sampling and the non-parametric contextual doubly-adaptive Thompson sampling extensions of our approach.

関連論文リスト

Adaptive Data Augmentation for Thompson Sampling [4.441866681085518]
線形文脈的包帯において、その目的は累積報酬を最大化する行動を選択することである。トンプソンサンプリングは経験的にうまく機能するが、最適の後悔境界は達成しない。本稿では,線形文脈帯域に対するほぼ最小のトンプソンサンプリングを提案する。
論文参考訳（メタデータ） (2025-06-17T12:57:33Z)
BOTS: Batch Bayesian Optimization of Extended Thompson Sampling for Severely Episode-Limited RL Settings [11.008537121214104]
線形トンプソンサンプリングバンドイットを状態作用ユーティリティ関数に基づいた選択動作に拡張する。提案手法は, トータルリターンの観点から, 標準トンプソンサンプリングを著しく上回り得ることを示す。
論文参考訳（メタデータ） (2024-11-30T01:27:44Z)
Neural Dueling Bandits [58.90189511247936]
ニューラルネットワークを用いて、予め選択した腕の好みフィードバックを用いて報酬関数を推定する。次に、理論結果を二項フィードバックによる文脈的帯域幅問題に拡張し、それはそれ自体は自明な寄与ではない。
論文参考訳（メタデータ） (2024-07-24T09:23:22Z)
VITS : Variational Inference Thompson Sampling for contextual bandits [10.028119153832346]
我々は、文脈的帯域幅に対するトンプソンサンプリング(TS)アルゴリズムの変種を導入・解析する。ガウス変分推論に基づく新しいアルゴリズムであるValational Inference Thompson sample VITSを提案する。我々は,VITS が線形文脈帯域に対して従来の TS の次元とラウンド数で同じ順序のサブ線形後悔境界を達成することを示す。
論文参考訳（メタデータ） (2023-07-19T17:53:22Z)
Incentivizing Exploration with Linear Contexts and Combinatorial Actions [9.15749739027059]
インセンティブ付きバンディット探索では、腕の選択は推奨され、ベイズ的なインセンティブと互換性が求められる。最近の研究は、十分な初期サンプルを収集した後、人気のあるトンプソンサンプリングアルゴリズムがインセンティブ互換になる、という一定の独立性の仮定の下で示されている。線形包帯に対してこの結果の類似性を与え、そこでは前者の独立性を自然凸条件に置き換える。
論文参考訳（メタデータ） (2023-06-03T03:30:42Z)
Thompson Sampling with Virtual Helping Agents [0.0]
我々は、オンラインのシーケンシャルな意思決定の問題、すなわち、現在の知識を活用して即時パフォーマンスを最大化し、新しい情報を探索して長期的な利益を得るというトレードオフに対処する。本稿では,マルチアームバンディット問題に対する2つのアルゴリズムを提案し,累積的後悔に関する理論的境界を提供する。
論文参考訳（メタデータ） (2022-09-16T23:34:44Z)
Langevin Monte Carlo for Contextual Bandits [72.00524614312002]
Langevin Monte Carlo Thompson Sampling (LMC-TS) が提案されている。提案アルゴリズムは,文脈的帯域幅の特別な場合において,最高のトンプソンサンプリングアルゴリズムと同じサブ線形残差を達成できることを示す。
論文参考訳（メタデータ） (2022-06-22T17:58:23Z)
Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization [50.725191156128645]
トンプソンサンプリングのようなマルチアームバンディットアルゴリズムは適応的な実験を行うのに利用できる。統計的解析のための一様ランダム化の利点を組み合わせた2つのアルゴリズムを探索する2つのアーム実験のシミュレーションを提案する。
論文参考訳（メタデータ） (2021-12-15T22:11:58Z)
Local policy search with Bayesian optimization [73.0364959221845]
強化学習は、環境との相互作用によって最適な政策を見つけることを目的としている。局所探索のための政策勾配は、しばしばランダムな摂動から得られる。目的関数の確率モデルとその勾配を用いたアルゴリズムを開発する。
論文参考訳（メタデータ） (2021-06-22T16:07:02Z)
Analysis and Design of Thompson Sampling for Stochastic Partial Monitoring [91.22679787578438]
部分モニタリングのためのトンプソンサンプリングに基づく新しいアルゴリズムを提案する。局所可観測性を持つ問題の線形化変種に対して,新たなアルゴリズムが対数問題依存の擬似回帰$mathrmO(log T)$を達成することを証明した。
論文参考訳（メタデータ） (2020-06-17T05:48:33Z)
Ensemble Sampling [18.85309520133554]
本稿では,ニューラルネットワークのような複雑なモデルに直面した場合でも,トラクタビリティを維持しつつ,トンプソンサンプリングを近似するアンサンブルサンプリングを開発する。我々は、このアプローチを支持する理論的基盤を確立し、さらなる洞察を提供する計算結果を示す。
論文参考訳（メタデータ） (2017-05-20T19:36:36Z)
The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime [52.38455827779212]
エミュレータと呼ばれる適応サンプリングを解析するための新しい手法を提案する。適切なログファクタを組み込んだトップk問題の最初のインスタンスベースの下位境界を証明します。我々の新しい分析は、後者の問題に対するこの種の最初のエミュレータであるベストアームとトップkの識別に、シンプルでほぼ最適であることを示した。
論文参考訳（メタデータ） (2017-02-16T23:42:02Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。