Fugu-MT 論文翻訳(概要): Design Experiments to Compare Multi-armed Bandit Algorithms

論文の概要: Design Experiments to Compare Multi-armed Bandit Algorithms

arxiv url: http://arxiv.org/abs/2603.05919v1
Date: Fri, 06 Mar 2026 05:17:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.103302
Title: Design Experiments to Compare Multi-armed Bandit Algorithms
Title（参考訳）: マルチアーム帯域幅アルゴリズムの比較のための設計実験
Authors: Huiling Meng, Ningyuan Chen, Xuefeng Gao,
Abstract要約: オンラインプラットフォームは、UCBやトンプソン・サンプリングといったマルチアームのバンディットアルゴリズムを常に比較して、最高のパフォーマンスポリシーを選択する。静的な処理のための標準的なA/Bテストとは異なり、$T$のユーザに対するバンディットアルゴリズムの各実行は、1つの依存した軌道のみを生成する。本稿では,この問題に対する新しい実験設計として,Artificial Replay (AR)を提案する。
参考スコア（独自算出の注目度）: 6.741852800770004
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Online platforms routinely compare multi-armed bandit algorithms, such as UCB and Thompson Sampling, to select the best-performing policy. Unlike standard A/B tests for static treatments, each run of a bandit algorithm over $T$ users produces only one dependent trajectory, because the algorithm's decisions depend on all past interactions. Reliable inference therefore demands many independent restarts of the algorithm, making experimentation costly and delaying deployment decisions. We propose Artificial Replay (AR) as a new experimental design for this problem. AR first runs one policy and records its trajectory. When the second policy is executed, it reuses a recorded reward whenever it selects an action the first policy already took, and queries the real environment only otherwise. We develop a new analytical framework for this design and prove three key properties of the resulting estimator: it is unbiased; it requires only $T + o(T)$ user interactions instead of $2T$ for a run of the treatment and control policies, nearly halving the experimental cost when both policies have sub-linear regret; and its variance grows sub-linearly in $T$, whereas the estimator from a naïve design has a linearly-growing variance. Numerical experiments with UCB, Thompson Sampling, and $ε$-greedy policies confirm these theoretical gains.
Abstract（参考訳）: オンラインプラットフォームは、UPBやトンプソン・サンプリングといったマルチアームのバンディットアルゴリズムを常に比較して、最高のパフォーマンスポリシーを選択する。静的な処理のための標準的なA/Bテストとは異なり、$T$以上のバンディットアルゴリズムの各実行は、アルゴリズムの判断が過去のすべてのインタラクションに依存するため、1つの依存した軌道のみを生成する。したがって、信頼性の高い推論では、アルゴリズムの独立した再起動が要求されるため、実験はコストがかかり、デプロイメントの決定が遅れる。本稿では,この問題に対する新しい実験設計として,Artificial Replay (AR)を提案する。 ARはまず1つのポリシーを実行し、その軌道を記録します。第2のポリシが実行されると、最初のポリシがすでに実行したアクションを選択すると、記録された報酬を再利用し、実際の環境のみをクエリする。我々は、この設計のための新しい分析フレームワークを開発し、結果として得られる推定器の3つの重要な特性を証明している: バイアスがない; 処理および制御ポリシーの実行に2Tドルではなく、わずか$T + o(T)$のユーザインタラクションしか必要とせず、両方のポリシーがサブ線形後悔を持つ場合の実験コストをほぼ半分にし、その分散は$T$でサブ線形に増加するが、ナイーブ設計からの推定器は線形に変化する分散を持つ。 UCB、トンプソンサンプリング、および$ε$-greedyポリシーによる数値実験は、これらの理論的な利得を裏付ける。

論文の概要: Design Experiments to Compare Multi-armed Bandit Algorithms

関連論文リスト