Fugu-MT 論文翻訳(概要): Best-of-Both-Worlds Linear Contextual Bandits

論文の概要: Best-of-Both-Worlds Linear Contextual Bandits

arxiv url: http://arxiv.org/abs/2312.16489v1
Date: Wed, 27 Dec 2023 09:32:18 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-29 19:16:37.989725
Title: Best-of-Both-Worlds Linear Contextual Bandits
Title（参考訳）: 両世界の最良のリニア・コンテクスト・バンディット
Authors: Masahiro Kato and Shinji Ito
Abstract要約: 本研究は, 対向汚職下での多武装盗賊問題の事例である$K$腕線形文脈盗賊の問題を考察する。我々は,理論的保証のもと,双方の敵環境に有効な戦略を開発する。両体制の理論的保証から,我々の戦略をBest-of-Both-Worlds (BoBW) RealFTRLと呼んでいる。
参考スコア（独自算出の注目度）: 45.378265414553226
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study investigates the problem of $K$-armed linear contextual bandits, an instance of the multi-armed bandit problem, under an adversarial corruption. At each round, a decision-maker observes an independent and identically distributed context and then selects an arm based on the context and past observations. After selecting an arm, the decision-maker incurs a loss corresponding to the selected arm. The decision-maker aims to minimize the cumulative loss over the trial. The goal of this study is to develop a strategy that is effective in both stochastic and adversarial environments, with theoretical guarantees. We first formulate the problem by introducing a novel setting of bandits with adversarial corruption, referred to as the contextual adversarial regime with a self-bounding constraint. We assume linear models for the relationship between the loss and the context. Then, we propose a strategy that extends the RealLinExp3 by Neu & Olkhovskaya (2020) and the Follow-The-Regularized-Leader (FTRL). The regret of our proposed algorithm is shown to be upper-bounded by $O\left(\min\left\{\frac{(\log(T))^3}{\Delta_{*}} + \sqrt{\frac{C(\log(T))^3}{\Delta_{*}}},\ \ \sqrt{T}(\log(T))^2\right\}\right)$, where $T \in\mathbb{N}$ is the number of rounds, $\Delta_{*} > 0$ is the constant minimum gap between the best and suboptimal arms for any context, and $C\in[0, T] $ is an adversarial corruption parameter. This regret upper bound implies $O\left(\frac{(\log(T))^3}{\Delta_{*}}\right)$ in a stochastic environment and by $O\left( \sqrt{T}(\log(T))^2\right)$ in an adversarial environment. We refer to our strategy as the Best-of-Both-Worlds (BoBW) RealFTRL, due to its theoretical guarantees in both stochastic and adversarial regimes.
Abstract（参考訳）: 本研究は, 対向汚職下での多武装盗賊問題の事例である$K$腕線形文脈盗賊の問題を考察する。各ラウンドにおいて、意思決定者は独立かつ同一に分散したコンテキストを観察し、そのコンテキストと過去の観察に基づいてarmを選択する。意思決定者は、アームを選択すると、選択されたアームに対応する損失を負う。意思決定者は裁判での累積損失を最小限に抑えることを目指している。本研究の目的は,理論的保証のある確率的・対角的両環境に有効な戦略を開発することである。まず, 自己拘束的制約を伴う文脈的逆境体制と呼ばれる, 逆境汚職を伴うバンディットの新たな設定を導入することで, 問題を定式化する。損失と文脈の関係を線形モデルとして仮定する。次に,Neu & Olkhovskaya (2020) と FTRL (Follow-The-Regularized-Leader) による RealLinExp3 の拡張戦略を提案する。 o\left(\min\left\{\frac{(\log(t))^3}{\delta_{*}} + \sqrt{\frac{c(\log(t))^3}{\delta_{*}}},\ \sqrt{t}(\log(t))^2\right\}\right)$, ここで$t \in\mathbb{n}$ はラウンド数、$\delta_{*} > 0$ は任意のコンテキストにおいて最善と亜最適のアームの間の一定最小ギャップであり、$c\in[0, t] $ は逆の腐敗パラメータである。この後悔の上限は、確率的な環境では$o\left(\frac{(\log(t))^3}{\delta_{*}}\right)$であり、敵対的な環境では$o\left( \sqrt{t}(\log(t))^2\right)$である。我々は、我々の戦略を「最高の両世界」(bobw)リアルフトルル(realftrl)と呼んでいる。

関連論文リスト

Follow-the-Perturbed-Leader Approaches Best-of-Both-Worlds for the m-Set Semi-Bandit Problems [20.47953854427799]
半帯域問題の場合、学習者は、合計$d$アームから$m$アームを正確に選択する。また, Fr'echet 摂動を持つFTPL は, 逆向き設定で $mathcalO(sqrtnmdlog(d))$ に近い最適リセットも享受できることを示す。
論文参考訳（メタデータ） (2025-04-09T22:07:01Z)
Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
人からのフィードバックから学ぶことは、大言語モデル(LLM)のような生成モデルを調整する上で重要な役割を果たす本稿では,このドメイン内のモデルについて考察する。-文脈的デュエルバンディット(contextual dueling bandits)と,正の選好ラベルを相手によって反転させることができる対向フィードバック(reversarial feedback)について考察する。本稿では,不確実性重み付き最大推定に基づく頑健なコンテキストデュエルバンドイット(RCDB)を提案する。
論文参考訳（メタデータ） (2024-04-16T17:59:55Z)
LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits [38.41164102066483]
本研究では、独立かつ同一に分散したコンテキストを持つ線形文脈帯域問題について考察する。提案アルゴリズムは、Tsallisエントロピーを持つFollow-The-Regularized-Leaderに基づいており、$alpha$-textual-Con (LC)-Tsallis-INFと呼ばれている。
論文参考訳（メタデータ） (2024-03-05T18:59:47Z)
Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
両世界のベスト・オブ・ワールドズ・アルゴリズムを$K$武器付き線形文脈包帯に対して検討する。我々のアルゴリズムは、敵対的体制と敵対的体制の両方において、ほぼ最適の後悔の限界を提供する。
論文参考訳（メタデータ） (2023-12-24T08:27:30Z)
Federated Linear Bandits with Finite Adversarial Actions [20.1041278044797]
我々は、M$のクライアントが中央サーバと通信し、線形文脈の帯域幅問題を解決するための連合線形帯域幅モデルについて検討する。逆有限作用集合のユニークな問題に対処するため、FedSupLinUCBアルゴリズムを提案する。我々は、FedSupLinUCBが$tildeO(sqrtd T)$の完全後悔を達成したことを証明している。
論文参考訳（メタデータ） (2023-11-02T03:41:58Z)
Corruption-Robust Offline Reinforcement Learning with General Function Approximation [60.91257031278004]
一般関数近似を用いたオフライン強化学習(RL)における劣化問題について検討する。我々のゴールは、崩壊しないマルコフ決定プロセス(MDP)の最適方針に関して、このような腐敗に対して堅牢で、最適でないギャップを最小限に抑える政策を見つけることである。
論文参考訳（メタデータ） (2023-10-23T04:07:26Z)
Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs [7.125769932993104]
リスク・逆条件下での線形ペイオフに対するコンテキスト多重武装バンディット問題について考察する。各ラウンドにおいて、各アームのコンテキストが明らかにされ、意思決定者は1つのアームを選択して、対応する報酬を受け取ります。解離モデルに対してトンプソンサンプリングアルゴリズムを適用し,提案アルゴリズムの変種に対する包括的後悔解析を行う。
論文参考訳（メタデータ） (2022-06-24T18:48:35Z)
Variance-Aware Sparse Linear Bandits [64.70681598741417]
余分な線形包帯に対する最悪のミニマックスは$widetildeThetaleft(sqrtdTright)$である。ノイズがなく、アクションセットが単位球面である良性設定では、ディビジョン・アンド・コンカーを使用して、$widetildemathcal O(1)$ regretを達成することができる。我々は,任意の分散対応線形帯域幅アルゴリズムを分散対応線形帯域幅アルゴリズムに変換する汎用フレームワークを開発した。
論文参考訳（メタデータ） (2022-05-26T15:55:44Z)
Linear Contextual Bandits with Adversarial Corruptions [91.38793800392108]
本稿では,敵対的腐敗の存在下での線形文脈的包帯問題について検討する。逆汚染レベルに適応する分散認識アルゴリズムをC$で提案する。
論文参考訳（メタデータ） (2021-10-25T02:53:24Z)
An Algorithm for Stochastic and Adversarial Bandits with Switching Costs [10.549307055348596]
そこで本研究では,マルチアームバンディットのスイッチングコストを考慮したアルゴリズムを提案し,そのアルゴリズムがアームを切り替える度に$lambda$を支払う。私たちのアルゴリズムは、Zimmert and Seldin(2021)のTsallis-INFアルゴリズムの適応に基づいています。
論文参考訳（メタデータ） (2021-02-19T11:03:51Z)
Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions [53.77572276969548]
我々は、C$2$UCBアルゴリズムが分割マトロイド制約に対して最適な後悔結合$tildeO(dsqrtkT + dk)$を有することを示した。一般的な制約に対して,C$2$UCBアルゴリズムで腕の報酬推定値を変更するアルゴリズムを提案する。
論文参考訳（メタデータ） (2021-01-20T04:29:18Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
制約付き文脈線形帯域設定について検討し、エージェントの目標は一連のポリシーを作成することである。楽観的悲観的線形帯域(OPLB)と呼ばれる,この問題に対する高信頼束縛アルゴリズムを提案する。
論文参考訳（メタデータ） (2020-06-17T22:32:19Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。