Fugu-MT 論文翻訳(概要): Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

論文の概要: Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

arxiv url: http://arxiv.org/abs/2506.16658v1
Date: Fri, 20 Jun 2025 00:09:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-23 19:00:05.295833
Title: Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards
Title（参考訳）: 機械学習によるサロゲートリワードによるマルチアーマッドバンド
Authors: Wenlong Ji, Yihan Pan, Ruihao Zhu, Lihua Lei,
Abstract要約: マルチアームバンディット(MAB)は、不確実性の下でのシーケンシャルな意思決定のための広く採用されているフレームワークである。我々は,事前学習された機械学習(ML)モデルを用いて,サイド情報と履歴データを報酬に変換するMABの新しい設定を提案する。この設定の顕著な特徴は、真の報酬データが通常オフラインフェーズでは利用できないため、サロゲート報酬が実質的なバイアスを示す可能性があることである。
参考スコア（独自算出の注目度）: 4.12484724941528
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-armed bandit (MAB) is a widely adopted framework for sequential decision-making under uncertainty. Traditional bandit algorithms rely solely on online data, which tends to be scarce as it must be gathered during the online phase when the arms are actively pulled. However, in many practical settings, rich auxiliary data, such as covariates of past users, is available prior to deploying any arms. We introduce a new setting for MAB where pre-trained machine learning (ML) models are applied to convert side information and historical data into \emph{surrogate rewards}. A prominent feature of this setting is that the surrogate rewards may exhibit substantial bias, as true reward data is typically unavailable in the offline phase, forcing ML predictions to heavily rely on extrapolation. To address the issue, we propose the Machine Learning-Assisted Upper Confidence Bound (MLA-UCB) algorithm, which can be applied to any reward prediction model and any form of auxiliary data. When the predicted and true rewards are jointly Gaussian, it provably improves the cumulative regret, provided that the correlation is non-zero -- even in cases where the mean surrogate reward completely misaligns with the true mean rewards. Notably, our method requires no prior knowledge of the covariance matrix between true and surrogate rewards. We compare MLA-UCB with the standard UCB on a range of numerical studies and show a sizable efficiency gain even when the size of the offline data and the correlation between predicted and true rewards are moderate.
Abstract（参考訳）: マルチアームバンディット(MAB)は、不確実性の下でのシーケンシャルな意思決定のための広く採用されているフレームワークである。従来のバンディットアルゴリズムはオンラインデータのみに依存しており、腕をアクティブに引っ張る際にはオンライン段階で収集しなければならないため、少ない傾向にある。しかし、多くの実践的な環境では、過去のユーザの共変量のような豊富な補助データが、武器を配備する前に利用可能である。我々は,事前学習機械学習(ML)モデルを応用して,サイド情報や履歴データを「emph{surrogate rewards}」に変換する,MABのための新しい設定を提案する。この設定の顕著な特徴は、真の報酬データが通常オフラインフェーズでは利用できないため、サロゲート報酬は実質的なバイアスを示す可能性があることであり、ML予測は外挿に強く依存せざるを得ない。この問題に対処するため,機械学習支援上信頼境界(MLA-UCB)アルゴリズムを提案する。予測された報酬と真の報酬が共同でガウス的であるとき、相関がゼロではないことを仮定して、累積的後悔を確実に改善する。特に、本手法は真と代理報酬の間の共分散行列に関する事前の知識を必要としない。我々は,MLA-UCB と標準 UCB を比較し,オフラインデータのサイズと予測値と真の報酬値の相関が適度である場合でも,大きな効率向上を示す。

論文の概要: Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

関連論文リスト