Fugu-MT 論文翻訳(概要): Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift

論文の概要: Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift

arxiv url: http://arxiv.org/abs/2605.10289v1
Date: Mon, 11 May 2026 09:50:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.707891
Title: Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift
Title（参考訳）: 分布シフトを用いたオフライン・オンライン学習のためのサンプル平均アンコレッドトンプサンプリング
Authors: Bochao Li, Yao Fu, Wei Chen, Fang Kong,
Abstract要約: オフラインからオンラインへの学習における中心的な課題は、オフラインデータとオンラインデータの分散シフトである。本稿では, 腕指数をオンライン後部サンプル, ハイブリッド後部サンプル, オンライン平均の中央値として定義する, 新たな中央値に基づくアンカールールを提案する。我々は,提案アルゴリズムがオフラインデータを安全に活用してオンライン学習を加速することを示す理論的保証を確立する。
参考スコア（独自算出の注目度）: 24.048629084196904
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Offline-to-online learning aims to improve online decision-making by leveraging offline logged data. A central challenge in this setting is the distribution shift between offline and online environments. While some existing works attempt to leverage shifted offline data, they largely rely on UCB-type algorithms. Thompson sampling (TS) represents another canonical class of bandit algorithms, well known for its strong empirical performance and naturally suited to offline-to-online learning through its Bayesian formulation. However, unlike UCB indices, posterior samples in TS are not guaranteed to be optimistic with respect to the true arm means. This makes indices constructed from purely online and hybrid data difficult to compare and complicates their use. To address this issue, we propose sample-mean anchored TS (Anchor-TS), which introduces a novel median-based anchoring rule that defines the arm index as the median of an online posterior sample, a hybrid posterior sample, and the online sample mean. The median anchoring systematically corrects bias induced by distribution shift by mitigating over-estimation for suboptimal arms and under-estimation for optimal arms, while exploiting offline information to obtain more accurate estimates when the shift is small. We establish theoretical guarantees showing that the proposed algorithm safely leverages offline data to accelerate online learning, and quantifying how the degree of distribution shift and the size of offline data affect the resulting regret reduction. Extensive experiments demonstrate consistent improvements of our algorithm over baselines.
Abstract（参考訳）: オフラインからオンラインへの学習は、オフラインログデータを活用することでオンライン意思決定を改善することを目的としている。この設定における中心的な課題は、オフライン環境とオンライン環境の間の分散シフトである。既存のいくつかの作業は、シフトしたオフラインデータを活用しようとするが、主に UCB 型アルゴリズムに依存している。トンプソンサンプリング(TS)は、その強い経験的性能で知られ、ベイズ式によるオフライン-オンライン学習に自然に適している、別の標準的バンディットアルゴリズムのクラスである。しかし、UTB指標とは異なり、TSの後方サンプルは真のアーム手段に対して楽観的であるという保証はない。これにより、純粋なオンラインデータとハイブリッドデータから構築されたインデックスを比較、複雑化するのが難しくなる。この問題に対処するために,サンプル平均アンカーTS(Anchor-TS)を提案する。これは,腕指数をオンライン後部サンプル,ハイブリッド後部サンプル,オンラインサンプル平均の中央値として定義する,新しい中央値ベースのアンカールールを導入する。中央アンカーは、最適アームの過大評価を軽減し、最適アームの過小評価を緩和し、オフライン情報を利用して、シフトが小さいときにより正確な推定値を得ることにより、分布シフトによって引き起こされるバイアスを体系的に補正する。我々は,提案アルゴリズムがオフラインデータを安全に活用してオンライン学習を促進できることを理論的に保証し,オフラインデータの分布変化の程度とサイズが,結果として生じる後悔の低減にどのように影響するかを定量化する。大規模な実験では、ベースラインよりもアルゴリズムが一貫した改善を実証している。

論文の概要: Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift

関連論文リスト