Fugu-MT 論文翻訳(概要): Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

論文の概要: Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

arxiv url: http://arxiv.org/abs/2306.06836v3
Date: Thu, 7 Mar 2024 15:29:58 GMT
ステータス: 翻訳完了
システム内更新日: 2024-03-08 18:17:38.266447
Title: Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds
Title（参考訳）: 関数近似を用いた強化学習における重機付きリワードの処理:ミニマックス最適およびインスタンス依存レグレト境界
Authors: Jiayi Huang, Han Zhong, Liwei Wang, Lin F. Yang
Abstract要約: 本研究では,線形関数近似を用いた強化学習におけるそのような報奨の課題に対処する。我々はまず,重み付き線形包帯に対するtextscHeavy-OFUL というアルゴリズムを設計し,インセンス依存の$T$-round regret of $tildeObig を実現した。我々の結果は、オンライン回帰問題全般において、重くノイズを扱うことに独立した関心を持つような、新しい自己正規化集中不等式によって達成される。
参考スコア（独自算出の注目度）: 26.277745106128197
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite $(1+\epsilon)$-th moments for some $\epsilon\in(0,1]$. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent} $T$-round regret of $\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)$, the \emph{first} of this kind. Here, $d$ is the feature dimension, and $\nu_t^{1+\epsilon}$ is the $(1+\epsilon)$-th central moment of the reward at the $t$-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent} $K$-episode regret of $\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K})$. Here, $H$ is length of the episode, and $\mathcal{U}^*, \mathcal{V}^*$ are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound $\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3 K})$ to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.
Abstract（参考訳）: 多くの研究は、一様有界の報酬を持つ強化学習(rl)のための効率的なアルゴリズムを考案することに焦点をあてているが、いくつかの$\epsilon\in(0,1]$ に対して有限$(1+\epsilon)$-th moments の報酬が \emph{heavy-tailed} である場合、大きな状態作用空間を持つrlのサンプルまたは時間効率のよいアルゴリズムが存在するかどうかという疑問が残されている。本稿では、線形関数近似を用いたRLにおけるそのような報酬の課題に対処する。まず,重尾付き線形バンドイットのアルゴリズムである \textsc{heavy-oful} を設計し,$\tilde{o}\big(d t^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^t \nu_t^2} + d t^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)$,この種の \emph{first} を達成する。ここで、$d$は特徴次元であり、$\nu_t^{1+\epsilon}$は$(1+\epsilon)$-th central moment of the reward at the $t$-th roundである。さらに, 確率的および決定論的線形バンドイットの最悪の場合に適用した場合, 上記の境界はミニマックス最適であることを示した。次に、このアルゴリズムを線形関数近似を用いてRL設定に拡張する。このアルゴリズムは \textsc{heavy-lsvi-ucb} と呼ばれ、計算効率のよい \emph{instance-dependent} $k$-episode regret of $\tilde{o}(d \sqrt{h \mathcal{u}^*} k^\frac{1}{1+\epsilon} + d \sqrt{h \mathcal{v}^* k})$ を達成する。ここで、$H$はエピソードの長さであり、$\mathcal{U}^* と \mathcal{V}^*$ はそれぞれ、報酬と値関数の中心モーメントを持つインスタンス依存の量スケーリングである。また、マッチングされたミニマックス下界 $\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3K})$ を提供し、最悪の場合、アルゴリズムの最適性を示す。我々の結果は、オンライン回帰問題全般において重み付きノイズを扱うことに独立した関心を持つような、新しい堅牢な自己正規化集中不等式によって達成される。

関連論文リスト

Heavy-Tailed Linear Bandits: Huber Regression with One-Pass Update [62.96781471194877]
ヘビーテール付きバンディットには、ヘビーテール付きノイズ、トランケーション、中央値の2つの基本戦略が導入されている。本稿では,オンラインミラー降下フレームワークに基づくEmphone-passアルゴリズムを提案する。
論文参考訳（メタデータ） (2025-03-01T09:41:45Z)
Efficient Continual Finite-Sum Minimization [52.5238287567572]
連続有限サム最小化(continuous finite-sum minimization)と呼ばれる有限サム最小化の鍵となるツイストを提案する。我々のアプローチは$mathcalO(n/epsilon)$ FOs that $mathrmStochasticGradientDescent$で大幅に改善されます。また、$mathcalOleft(n/epsilonalpharight)$ complexity gradient for $alpha 1/4$という自然な一階法は存在しないことを証明し、この方法の第一階法がほぼ密であることを示す。
論文参考訳（メタデータ） (2024-06-07T08:26:31Z)
Sample-Efficient Linear Regression with Self-Selection Bias [7.605563562103568]
未知のインデックス設定における自己選択バイアスを伴う線形回帰の問題を考察する。我々は,$mathbfw_1,ldots,mathbfw_kinを復元する,新しい,ほぼ最適なサンプル効率($k$)アルゴリズムを提案する。このアルゴリズムは雑音の仮定をかなり緩めることに成功し、従って関連する最大線形回帰の設定にも成功している。
論文参考訳（メタデータ） (2024-02-22T02:20:24Z)
Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension [18.57735939471469]
我々は注意問題のスパシフィケーションを考慮する。超大規模特徴量の場合、文の長さをほぼ線形に縮めることができる。
論文参考訳（メタデータ） (2023-04-10T05:52:38Z)
Eluder-based Regret for Stochastic Contextual MDPs [43.19667415823089]
文脈マルコフ決定過程(CMDP)における後悔最小化のためのE-UC$3$RLアルゴリズムを提案する。我々のアルゴリズムは効率的であり(効率的なオフライン回帰オラクルを仮定すると)、$ widetildeO(H3 sqrtT |S| |A|d_mathrmE(mathcalP)$の後悔の保証を享受する。
論文参考訳（メタデータ） (2022-11-27T20:38:47Z)
Learning a Single Neuron with Adversarial Label Noise via Gradient Descent [50.659479930171585]
モノトン活性化に対する $mathbfxmapstosigma(mathbfwcdotmathbfx)$ の関数について検討する。学習者の目標は仮説ベクトル $mathbfw$ that $F(mathbbw)=C, epsilon$ を高い確率で出力することである。
論文参考訳（メタデータ） (2022-06-17T17:55:43Z)
Threshold Phenomena in Learning Halfspaces with Massart Noise [56.01192577666607]
ガウス境界の下でのマスアートノイズ付きmathbbRd$におけるPAC学習ハーフスペースの問題について検討する。この結果は,Massartモデルにおける学習ハーフスペースの複雑さを定性的に特徴づけるものである。
論文参考訳（メタデータ） (2021-08-19T16:16:48Z)
Faster Rates of Differentially Private Stochastic Convex Optimization [7.93728520583825]
人口リスク関数がTysbakovノイズ条件(TNC)をパラメータ$theta>1$で満たす場合について検討した。第2部では,人口リスク関数が強く凸する特殊な事例に着目した。
論文参考訳（メタデータ） (2021-07-31T22:23:39Z)
Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
我々は,バンディットフィードバックを用いてオンライン学習を学習する。 learnerは、コスト/リワード関数が"pseudo-1d"構造を許可するゼロ次オラクルのみにアクセスできる。我々は、$T$がラウンドの数である任意のアルゴリズムの後悔のために$min(sqrtdT、T3/4)$の下限を示しています。ランダム化オンライングラデーション下降とカーネル化指数重み法を組み合わせた新しいアルゴリズムsbcalgを提案し,疑似-1d構造を効果的に活用する。
論文参考訳（メタデータ） (2021-02-15T08:16:51Z)
Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes [91.38793800392108]
本稿では,マルコフ決定過程(MDP)の遷移確率核が線形混合モデルである線形関数近似による強化学習について検討する。上記の線形混合 MDP に対して$textUCRL-VTR+$ という線形関数近似を用いた計算効率の良い新しいアルゴリズムを提案する。我々の知る限り、これらは線形関数近似を持つRLのための計算効率が良く、ほぼ最小のアルゴリズムである。
論文参考訳（メタデータ） (2020-12-15T18:56:46Z)
On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions [62.01594253618911]
有限和の有限ノイズ構造を利用して、大域オラクルモデルの下での一致する$O(n2)$-upper境界を導出する。同様のアプローチを踏襲したSVRGの新規な適応法を提案し、これはオラクルと互換性があり、$tildeO(n2+nsqrtL/mu)log (1/epsilon)$と$O(nsqrtL/epsilon)$, for $mu>0$と$mu=0$の複雑さ境界を実現する。
論文参考訳（メタデータ） (2020-02-09T03:39:46Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。