Fugu-MT 論文翻訳(概要): Stochastic Contextual Bandits with Long Horizon Rewards

論文の概要: Stochastic Contextual Bandits with Long Horizon Rewards

arxiv url: http://arxiv.org/abs/2302.00814v2
Date: Fri, 3 Feb 2023 20:01:20 GMT
ステータス: 翻訳完了
システム内更新日: 2023-02-07 12:57:48.220564
Title: Stochastic Contextual Bandits with Long Horizon Rewards
Title（参考訳）: 長い水平後退を伴う確率的文脈帯域
Authors: Yuzhen Qin, Yingcong Li, Fabio Pasqualetti, Maryam Fazel, Samet Oymak
Abstract要約: この研究は、現在の報酬がほとんどの$s$以前のアクションに依存する文脈的線形包帯を調査することによって、この方向に一歩踏み出す。本稿では,親和性を利用して依存パターンと腕パラメータを共同で発見するアルゴリズムを提案する。以上の結果から,データ間の時間的長期依存に対処し,報奨地平線への依存を避けるための新たな分析が必要である。
参考スコア（独自算出の注目度）: 31.981315673799806
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most $s$ prior actions and contexts (not necessarily consecutive), up to a time horizon of $h$. In order to avoid polynomial dependence on $h$, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor ($T<h$) and data-rich ($T\ge h$) regimes, and derive respective regret upper bounds $\tilde O(d\sqrt{sT} +\min\{ q, T\})$ and $\tilde O(\sqrt{sdT})$, with sparsity $s$, feature dimension $d$, total time horizon $T$, and $q$ that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon $h$. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.
Abstract（参考訳）: 複雑な意思決定や言語モデリング問題への関心が高まる中、非常に長い地平線のサンプル効率の高い学習の重要性が浮き彫りになっている。この研究は、現在の報酬が少なくとも$s$以前のアクションとコンテキスト(必ずしも連続ではない)に依存している文脈的線形包帯を$h$の時間的地平線まで調べることによって、この方向に一歩踏み出す。 h$ に対する多項式依存を避けるために,スパルシリティを活用し,従属パターンとアームパラメータを協調的に発見する新しいアルゴリズムを提案する。 data-poor(T<h$) と data-rich(T\ge h$) のレギュレーションの両方を検討し、それぞれの後悔の上限を $\tilde O(d\sqrt{sT} +\min\{ q, T\})$ と $\tilde O(\sqrt{sdT})$ と $\tilde O(\sqrt{sdT})$ で表す。従属パターンとアームパラメータは rank-1 行列を形成するが、循環行列は rank-1 多様体よりも等尺的ではなく、サンプル複雑性はスパース報酬依存構造から恩恵を受ける。以上の結果から,データ全体にわたる長期時間依存性に対する新たな解析が必要となり,報奨地平線上の多項式依存性を回避できた。具体的には、依存準ガウスベクトルによって形成される循環行列の制限等尺性への接続を利用し、独立性を持つ新しい保証を確立する。

関連論文リスト

Restless Linear Bandits [5.00389879175348]
未知の$mathbbRd$-valued stationary $varphi$-mixing sequence of parameters $(theta_t,t in mathbbN)$ が存在すると仮定される。指数混合率が$theta_t$の場合、LinMix-UCBと呼ばれる楽観的なアルゴリズムが提案される。
論文参考訳（メタデータ） (2024-05-17T14:37:39Z)
Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
線形マルコフ決定過程(MDP)を学習するための新たな報酬なしアルゴリズムを提案する。我々のアルゴリズムの核心は、探索駆動の擬似回帰を用いた不確実性重み付き値目標回帰である。我々のアルゴリズムは$tilde O(d2varepsilon-2)$ episodesを探索するだけで、$varepsilon$-optimal policyを見つけることができる。
論文参考訳（メタデータ） (2023-03-17T17:53:28Z)
Online Learning and Bandits with Queried Hints [28.270453093780382]
従来のオンライン学習とマルチアーム・バンディット(MAB)の問題について考察する。残差が指数関数的に時間的地平線に依存するアルゴリズムを導出する。オンライン線形および凸最適化のための時間非依存の後悔境界を達成するために,$k=2$ suffices を用いて探索を行うことが示される。
論文参考訳（メタデータ） (2022-11-04T18:41:08Z)
Exploration in Linear Bandits with Rich Action Sets and its Implications for Inference [23.364534479492715]
期待行列の最小固有値は、アルゴリズムの累積後悔が$sqrtn)$であるときに、$Omega(sqrtn)として増加することを示す。本研究は, 線形帯域におけるEmphmodel selectionとEmphclusteringの2つの実践シナリオに適用する。
論文参考訳（メタデータ） (2022-07-23T20:25:07Z)
Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies [88.75843804630772]
我々は既存の境界に対して,$Oleft(mathrmpoly(S,A,log K)sqrtKright)を後悔するアルゴリズムを設計する。この結果は、定常政策の近似力、安定性、および濃度特性を確立する新しい構造補題の列に依存している。
論文参考訳（メタデータ） (2022-03-24T08:14:12Z)
Leverage Score Sampling for Tensor Product Matrices in Input Sparsity Time [54.65688986250061]
我々は,$q$-foldカラムワイドテンソル積の$q$行列に対応するグラム行列を近似するための入力空間時間サンプリングアルゴリズムを提案する。我々のサンプリング技術は、合計時間でデータセット$X$に同時に適用できる$q$部分相関ランダムプロジェクションのコレクションに依存している。
論文参考訳（メタデータ） (2022-02-09T15:26:03Z)
Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning [82.31436758872715]
我々は,環境相互作用の$O(1)$のエピソードのみを用いて,同一のPAC保証を実現するアルゴリズムを開発した。値関数と有限水平マルコフ決定過程の接続を確立する。
論文参考訳（メタデータ） (2021-11-01T00:21:24Z)
On the Sample Complexity of Privately Learning Axis-Aligned Rectangles [16.092248433189816]
有限格子上で軸-アラインド-矩形を学習する基本的な問題を再考する。サンプルの複雑さを$tildeOleftdcdotleft(log*|X|right)1.5right$に減らす新しいアルゴリズムを提案する。
論文参考訳（メタデータ） (2021-07-24T04:06:11Z)
High-Dimensional Sparse Linear Bandits [67.9378546011416]
データ・ポーア・システマティクスにおける疎線形包帯に対して、新しい$Omega(n2/3)$ dimension-free minimax regret lower boundを導出する。また、関連する特徴に対する信号の大きさに関する追加の仮定の下で、次元のない$O(sqrtn)$ regret上界も証明する。
論文参考訳（メタデータ） (2020-11-08T16:48:11Z)
Linear-Sample Learning of Low-Rank Distributions [56.59844655107251]
ktimes k$, rank-r$, matrices to normalized $L_1$ distance requires $Omega(frackrepsilon2)$ sample。我々は、$cal O(frackrepsilon2log2fracepsilon)$ sample, a number linear in the high dimension, and almost linear in the matrices, usually low, rank proofs.というアルゴリズムを提案する。
論文参考訳（メタデータ） (2020-09-30T19:10:32Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。