Fugu-MT 論文翻訳(概要): Contexts can be Cheap: Solving Stochastic Contextual Bandits with Linear Bandit Algorithms

論文の概要: Contexts can be Cheap: Solving Stochastic Contextual Bandits with Linear Bandit Algorithms

arxiv url: http://arxiv.org/abs/2211.05632v1
Date: Tue, 8 Nov 2022 22:18:53 GMT
ステータス: 翻訳完了
システム内更新日: 2022-11-11 14:39:45.748335
Title: Contexts can be Cheap: Solving Stochastic Contextual Bandits with Linear Bandit Algorithms
Title（参考訳）: コンテキストはチープになる:線形帯域アルゴリズムで確率的コンテキスト帯域を解く
Authors: Osama A. Hanna, Lin F. Yang, Christina Fragouli
Abstract要約: 我々は,意思決定者がコンテキストを提供するコンテキスト線形帯域問題に対処する。文脈問題を線形バンディット問題として解くことができることを示す。この結果から,文脈的線形包帯に対して$O(dsqrtTlog T)$高確率残差が生じることが示唆された。
参考スコア（独自算出の注目度）: 39.70492757288025
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we address the stochastic contextual linear bandit problem, where a decision maker is provided a context (a random set of actions drawn from a distribution). The expected reward of each action is specified by the inner product of the action and an unknown parameter. The goal is to design an algorithm that learns to play as close as possible to the unknown optimal policy after a number of action plays. This problem is considered more challenging than the linear bandit problem, which can be viewed as a contextual bandit problem with a \emph{fixed} context. Surprisingly, in this paper, we show that the stochastic contextual problem can be solved as if it is a linear bandit problem. In particular, we establish a novel reduction framework that converts every stochastic contextual linear bandit instance to a linear bandit instance, when the context distribution is known. When the context distribution is unknown, we establish an algorithm that reduces the stochastic contextual instance to a sequence of linear bandit instances with small misspecifications and achieves nearly the same worst-case regret bound as the algorithm that solves the misspecified linear bandit instances. As a consequence, our results imply a $O(d\sqrt{T\log T})$ high-probability regret bound for contextual linear bandits, making progress in resolving an open problem in (Li et al., 2019), (Li et al., 2021). Our reduction framework opens up a new way to approach stochastic contextual linear bandit problems, and enables improved regret bounds in a number of instances including the batch setting, contextual bandits with misspecifications, contextual bandits with sparse unknown parameters, and contextual bandits with adversarial corruption.
Abstract（参考訳）: 本稿では,意思決定者が文脈(分布から引き出されたランダムな動作の集合)を提供する確率的文脈線形バンディット問題に対処する。各アクションの期待される報酬は、アクションの内部積と未知のパラメータによって指定される。ゴールは、多くのアクションが実行された後、未知の最適ポリシーにできるだけ近くプレイすることを学ぶアルゴリズムを設計することである。この問題は線型バンディット問題よりも困難であると考えられており、これは文脈的バンディット問題として \emph{fixed} コンテキストとみなすことができる。驚くべきことに,本稿では,確率的文脈問題は線形バンディット問題であるかのように解くことができることを示した。特に、文脈分布が知られている場合、全ての確率的文脈線形バンドイットインスタンスを線形バンドイットインスタンスに変換する新しい還元フレームワークを確立する。文脈分布が不明な場合には、確率的文脈インスタンスを小さな誤特定を伴う線形バンディットインスタンス列に縮小し、誤特定された線形バンディットインスタンスを解決するアルゴリズムとほぼ同じ最悪の場合の後悔を実現できるアルゴリズムを確立する。その結果、文脈線形バンドイットに縛られた高い確率の後悔がo(d\sqrt{t\log t})$を示し、(li et al., 2019), (li et al., 2021) の未解決問題を解く過程が進行した。私たちの還元フレームワークは,確率的文脈線形バンディット問題へのアプローチの新たな方法を開き,バッチ設定,不特定化を伴うコンテキストバンディット,未知パラメータの少ないコンテキストバンディット,対向汚職を伴うコンテキストバンディットなど,多数のインスタンスにおける後悔境界の改善を可能にします。

関連論文リスト

Nearly Tight Bounds for Cross-Learning Contextual Bandits with Graphical Feedback [17.43748116766233]
グラフィカルフィードバックを用いたクロスラーニングの文脈的帯域幅問題に注目が集まっている。鍵となる理論的問題は、$widetildeO(sqrtalpha T)$ regret のアルゴリズムが存在するかどうかである。
論文参考訳（メタデータ） (2025-02-07T05:52:56Z)
A conversion theorem and minimax optimality for continuum contextual bandits [70.71582850199871]
本研究では,学習者が側情報ベクトルを逐次受信し,凸集合内の行動を選択する,文脈連続帯域幅問題について検討する。目標は、受信したコンテキストのすべての基盤関数を最小化することです。サブ線形の静的な後悔を達成するアルゴリズムを拡張して、サブ線形の文脈的後悔を実現することができることを示す。
論文参考訳（メタデータ） (2024-06-09T10:12:08Z)
On the Optimal Regret of Locally Private Linear Contextual Bandit [18.300225068036642]
局所的なプライベートな文脈的帯域に対して,$tilde O(sqrtT)$ regret upper bound を達成可能であることを示す。我々の解決策は、いくつかの新しいアルゴリズム的および分析的アイデアに依存している。
論文参考訳（メタデータ） (2024-04-15T02:00:24Z)
Feel-Good Thompson Sampling for Contextual Dueling Bandits [49.450050682705026]
FGTS.CDBという名前のトンプソンサンプリングアルゴリズムを提案する。われわれのアルゴリズムの核心は、デュエルバンディットに適した新しいFeel-Good探索用語である。我々のアルゴリズムは最小限の誤差、すなわち $tildemathcalO(dsqrt T)$, $d$ はモデル次元、$T$ は時間水平線である。
論文参考訳（メタデータ） (2024-04-09T04:45:18Z)
Optimal cross-learning for contextual bandits with unknown context distributions [28.087360479901978]
本稿では,バルセイロ等のクロスラーニング環境において,文脈的包括的アルゴリズムを設計する際の問題点について考察する。コンテクスト数によらずに$widetildeO(sqrtTK)$というほぼ厳密な(対数的要因まで)後悔境界を持つ効率的なアルゴリズムを提供する。アルゴリズムのコアとなるのは,複数のエポックにまたがるアルゴリズムの実行をコーディネートする新しい手法である。
論文参考訳（メタデータ） (2024-01-03T18:02:13Z)
Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes [59.61248760134937]
本稿では,$tildeO(sqrtT+zeta)$を後悔するアルゴリズムを提案する。提案アルゴリズムは、最近開発された線形文脈帯域からの不確実性重み付き最小二乗回帰に依存する。本稿では,提案アルゴリズムをエピソディックなMDP設定に一般化し,まず汚職レベル$zeta$への付加的依存を実現する。
論文参考訳（メタデータ） (2022-12-12T15:04:56Z)
Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
政策後悔は、適応的な敵に対してオンライン学習アルゴリズムのパフォーマンスを測定するという、よく確立された概念である。我々は,不完全な政策後悔を効果的に最小化できる敵の制限について検討する。我々は、$tildemathcalO(mKsqrtT)$の完全なポリシーを後悔するアルゴリズムを提供し、$tildemathcalO$表記は対数要素だけを隠す。
論文参考訳（メタデータ） (2022-04-24T03:10:27Z)
Stochastic Linear Bandits Robust to Adversarial Attacks [117.665995707568]
我々はロバスト位相除去アルゴリズムの2つの変種を提供し、その1つは$C$を知っており、もう1つはそうでない。いずれの変種も、倒壊しない場合には、それぞれ$C = 0$ となり、それぞれ追加の加法項が生じる。文脈的設定では、単純な欲求的アルゴリズムは、明示的な探索を行わず、C$を知らないにもかかわらず、ほぼ最適加法的後悔項で証明可能な堅牢性を示す。
論文参考訳（メタデータ） (2020-07-07T09:00:57Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
制約付き文脈線形帯域設定について検討し、エージェントの目標は一連のポリシーを作成することである。楽観的悲観的線形帯域(OPLB)と呼ばれる,この問題に対する高信頼束縛アルゴリズムを提案する。
論文参考訳（メタデータ） (2020-06-17T22:32:19Z)
Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits under Realizability [18.45278329799526]
オフライン回帰オラクルへの$O(log T)$呼び出しだけで統計的に最適な後悔を実現する高速で単純なアルゴリズムを設計する。この結果は、文脈的帯域幅からオフライン回帰への最初の普遍的かつ最適の還元を提供する。
論文参考訳（メタデータ） (2020-03-28T04:16:52Z)
Contextual Blocking Bandits [35.235375147227124]
我々は,多腕バンディット問題の新たな変種について検討し,各ステップごとに,腕の平均報酬を決定する独立したサンプルコンテキストをプレイヤーが観察する。アームを再生することで(すべてのコンテキストにわたって)将来の時間ステップの固定および既知の回数をブロックする。我々は、$mathcalO(log T)$-regret w.r.t.$alpha$regret戦略を$Tタイムステップで保証し、$Omega(log(T)$low boundと一致する UCB ベースのフル情報アルゴリズムの変種を提案する。
論文参考訳（メタデータ） (2020-03-06T20:34:42Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。