Fugu-MT 論文翻訳(概要): Rethinking Langevin Thompson Sampling from A Stochastic Approximation Perspective

論文の概要: Rethinking Langevin Thompson Sampling from A Stochastic Approximation Perspective

arxiv url: http://arxiv.org/abs/2510.05023v1
Date: Mon, 06 Oct 2025 17:01:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.999796
Title: Rethinking Langevin Thompson Sampling from A Stochastic Approximation Perspective
Title（参考訳）: 確率近似の観点からのランゲヴィン・トンプソンサンプリングの再考
Authors: Weixin Wang, Haoyang Zheng, Guang Lin, Wei Deng, Pan Xu,
Abstract要約: 本稿では、トンプソンサンプリング(TS)フレームワークに近似(SA)を組み込んだTS-SAを紹介する。各ラウンドにおいて、TS-SAは直近の報酬のみを用いて後部近似を構築し、時間とともにノイズの多い提案にSAステップを適用する。これは、アルゴリズム全体を通して静止後ターゲットを近似するものとして解釈できる。
参考スコア（独自算出の注目度）: 17.53150194998013
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most existing approximate Thompson Sampling (TS) algorithms for multi-armed bandits use Stochastic Gradient Langevin Dynamics (SGLD) or its variants in each round to sample from the posterior, relaxing the need for conjugacy assumptions between priors and reward distributions in vanilla TS. However, they often require approximating a different posterior distribution in different round of the bandit problem. This requires tricky, round-specific tuning of hyperparameters such as dynamic learning rates, causing challenges in both theoretical analysis and practical implementation. To alleviate this non-stationarity, we introduce TS-SA, which incorporates stochastic approximation (SA) within the TS framework. In each round, TS-SA constructs a posterior approximation only using the most recent reward(s), performs a Langevin Monte Carlo (LMC) update, and applies an SA step to average noisy proposals over time. This can be interpreted as approximating a stationary posterior target throughout the entire algorithm, which further yields a fixed step-size, a unified convergence analysis framework, and improved posterior estimates through temporal averaging. We establish near-optimal regret bounds for TS-SA, with a simplified and more intuitive theoretical analysis enabled by interpreting the entire algorithm as a simulation of a stationary SGLD process. Our empirical results demonstrate that even a single-step Langevin update with certain warm-up outperforms existing methods substantially on bandit tasks.
Abstract（参考訳）: 既存の近似トンプソンサンプリング(TS)アルゴリズムは、SGLD(Stochastic Gradient Langevin Dynamics)またはその変種を用いて、後方からサンプルを採取し、バニラTSの事前分布と報酬分布の間の共役仮定の必要性を緩和する。しかし、それらはしばしば、バンドイット問題の異なるラウンドで異なる後部分布を近似する必要がある。これは、動的学習率などのハイパーパラメータのトリッキーでラウンド特異的なチューニングを必要とし、理論解析と実践的な実装の両方において困難を引き起こす。この非定常性を軽減するため、TSフレームワークに確率近似(SA)を組み込んだTS-SAを導入する。各ラウンドでTS-SAは、最新の報酬のみを使用して後部近似を構築し、Langevin Monte Carlo (LMC) 更新を実行し、時間とともにノイズの多い提案にSAステップを適用する。これは、アルゴリズム全体を通して定常的な後方目標を近似したものと解釈でき、これはさらに、固定ステップサイズ、統合収束分析フレームワーク、時間平均化による後方推定の改善をもたらす。 SGLDプロセスのシミュレーションとしてアルゴリズム全体を解釈することにより、TS-SAのほぼ最適後悔境界を確立し、よりシンプルで直感的な理論的解析を可能にした。実験結果から, あるウォームアップを伴う単一ステップのLangevin更新でも, バンドイットタスクにおいて, 既存の手法を著しく上回る結果が得られた。

論文の概要: Rethinking Langevin Thompson Sampling from A Stochastic Approximation Perspective

関連論文リスト