Fugu-MT 論文翻訳(概要): An LP-based Sampling Policy for Multi-Armed Bandits with Side-Observations and Stochastic Availability

論文の概要: An LP-based Sampling Policy for Multi-Armed Bandits with Side-Observations and Stochastic Availability

arxiv url: http://arxiv.org/abs/2603.26647v1
Date: Fri, 27 Mar 2026 17:50:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.625166
Title: An LP-based Sampling Policy for Multi-Armed Bandits with Side-Observations and Stochastic Availability
Title（参考訳）: サイドオブザーバと確率的アベイラビリティを備えたマルチアーマッド帯域に対するLPベースサンプリングポリシー
Authors: Ashutosh Soni, Peizhong Ju, Atilla Eryilmaz, Ness B. Shroff,
Abstract要約: 本稿では,ネットワーク構造が関連する動作を横取り可能なマルチアーム・バンディット(MAB)問題について検討する。我々は、アクションを未知の集合にリンクするために二部グラフを使用し、アクションを選択すると、それが関連付けられているすべての未知の観測結果が明らかになる。
参考スコア（独自算出の注目度）: 40.54677625701658
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the stochastic multi-armed bandit (MAB) problem where an underlying network structure enables side-observations across related actions. We use a bipartite graph to link actions to a set of unknowns, such that selecting an action reveals observations for all the unknowns it is connected to. While previous works rely on the assumption that all actions are permanently accessible, we investigate the more practical setting of stochastic availability, where the set of feasible actions (the "activation set") varies dynamically in each round. This framework models real-world systems with both structural dependencies and volatility, such as social networks where users provide side-information about their peers' preferences, yet are not always online to be queried. To address this challenge, we propose UCB-LP-A, a novel policy that leverages a Linear Programming (LP) approach to optimize exploration-exploitation trade-offs under stochastic availability. Unlike standard network bandit algorithms that assume constant access, UCB-LP-A computes an optimal sampling distribution over the realizable activation sets, ensuring that the necessary observations are gathered using only the currently active arms. We derive a theoretical upper bound on the regret of our policy, characterizing the impact of both the network structure and the activation probabilities. Finally, we demonstrate through numerical simulations that UCB-LP-A significantly outperforms existing heuristics that ignore either the side-information or the availability constraints.
Abstract（参考訳）: 本稿では,ネットワーク構造が関連する動作の側方観測を可能にする確率的マルチアーム・バンディット(MAB)問題について検討する。我々は、アクションを未知の集合にリンクするために二部グラフを使用し、アクションを選択すると、それが関連付けられているすべての未知の観測結果が明らかになる。これまでの研究は、全てのアクションが永久にアクセス可能であるという仮定に依存していたが、より実践的な確率的可用性の設定について検討し、各ラウンドにおいて実行可能なアクションのセット(「アクティベーションセット」)が動的に変化する。このフレームワークは、ソーシャルネットワークのような構造的依存関係とボラティリティの両方を持つ現実世界のシステムをモデル化する。この課題に対処するために,線形プログラミング(LP)アプローチを利用して,確率的可用性下での探索・探索トレードオフを最適化する新しいポリシーである UCB-LP-A を提案する。一定のアクセスを仮定する標準的なネットワーク帯域幅アルゴリズムとは異なり、UTB-LP-Aは実現可能なアクティベーションセット上の最適なサンプリング分布を計算し、現在のアクティブアームのみを使用して必要な観測を収集する。我々は、ネットワーク構造とアクティベーション確率の両方の影響を特徴付ける、政策の後悔に基づく理論上の上限を導出する。最後に,UCB-LP-Aは,情報・可用性の制約を無視する既存のヒューリスティックよりも優れていることを示す。

論文の概要: An LP-based Sampling Policy for Multi-Armed Bandits with Side-Observations and Stochastic Availability

関連論文リスト