Fugu-MT 論文翻訳(概要): Multi-Armed Bandits With Best-Action Queries

論文の概要: Multi-Armed Bandits With Best-Action Queries

arxiv url: http://arxiv.org/abs/2605.08287v1
Date: Fri, 08 May 2026 08:14:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.532654
Title: Multi-Armed Bandits With Best-Action Queries
Title（参考訳）: Best-Action Queries を用いたマルチアーマッドバンド
Authors: Francesco Bacchiocchi, Matteo Castiglioni, Alberto Marchesi, Francesco Emanuele Stradi,
Abstract要約: Emphbest-action queryを併用したEmphmulti-armed bandits(MABs)の検討ベストアクションクエリは、最適な$widetildemathcalO(sqrtT)後悔を$widetildemathcalO(minT/k,sqrtT)$に還元する。
参考スコア（独自算出の注目度）: 29.740898640511336
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study \emph{multi-armed bandits} (MABs) augmented with \emph{best-action queries}, in which the learner may additionally query an oracle that reveals the best arm in the current round. This setting was recently characterized by Russo et al. [2024] in the \emph{full-feedback} model, where the learner observes the rewards of all arms after each round. They show that, in both \emph{stochastic} and \emph{adversarial} environments, $k$ best-action queries reduce the optimal $\widetilde{\mathcal{O}}(\sqrt{T})$ regret to $\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T}\})$. Whether this improvement extends to the more realistic \emph{bandit-feedback} model -- where the learner observes only the reward of the played arm -- was left as an open problem. We fully resolve this question. When rewards are stochastic but correlated among arms, we show that the full-feedback result does not carry over: any algorithm must incur regret at least $Ω(\sqrt{T-k})$. This lower bound directly extends to adversarial environments. On the positive side, we show that $\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T-k}\})$ regret is still achievable when rewards are stochastic and i.i.d., and establish a matching lower bound, up to logarithmic factors. Together, these results provide a complete characterization of the benefits of \emph{best-action queries} in the \emph{bandit-feedback} model.
Abstract（参考訳）: そこで,本研究では,学習者が現在のラウンドで最高の腕を明らかにするオラクルに問い合わせることのできる,‘emph{best-action query} を付加した 'emph{multi-armed bandits} (MABs) について検討する。この設定は、学習者が各ラウンド後のすべての腕の報酬を観察する 'emph{full-feedback} モデルでRusso et al [2024] によって最近特徴づけられた。それらは、 \emph{stochastic} と \emph{adversarial} の両方の環境で、$k$ ベストアクションクエリは最適な $\widetilde{\mathcal{O}}(\sqrt{T})$ regret to $\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T}\})$ を減少させる。この改善がより現実的な 'emph{bandit-feedback} モデルにまで拡大するかどうか – 学習者はプレーアームの報酬のみを観察する – がオープンな問題として残された。私たちはこの問題を完全に解決する。報酬が確率的だが腕の間で相関している場合、全フィードバックの結果は継続しない:任意のアルゴリズムは少なくとも$Ω(\sqrt{T-k})$を後悔しなければならない。この下界は直接敵の環境に広がる。正の面では、$\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T-k}\})$ regret は、報酬が確率的、すなわち、対数的要因まで一致する下界を確立するときにも達成可能であることを示す。これらの結果と共に、これらの結果は \emph{bandit-feedback} モデルにおける \emph{best-action query} の利点の完全な評価を与える。

論文の概要: Multi-Armed Bandits With Best-Action Queries

関連論文リスト