Fugu-MT 論文翻訳(概要): Variance-aware robust reinforcement learning with linear function approximation with heavy-tailed rewards

論文の概要: Variance-aware robust reinforcement learning with linear function approximation with heavy-tailed rewards

arxiv url: http://arxiv.org/abs/2303.05606v1
Date: Thu, 9 Mar 2023 22:16:28 GMT
ステータス: 翻訳完了
システム内更新日: 2023-03-13 16:45:18.875698
Title: Variance-aware robust reinforcement learning with linear function approximation with heavy-tailed rewards
Title（参考訳）: 重み付き報酬付き線形関数近似による分散アウェアロバスト強化学習
Authors: Xiang Li, Qiang Sun
Abstract要約: AdaOFUL と VARA という2つのアルゴリズムを,重み付き報酬の存在下でのオンラインシーケンシャルな意思決定のために提案する。 AdaOFULは、$widetildemathcalObigの最先端の後悔境界を達成する。 VarA は $widetildemathcalO(dsqrtHmathcalG*K)$ のより厳密な分散を考慮した後悔境界を達成する。
参考スコア（独自算出の注目度）: 6.932056534450556
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents two algorithms, AdaOFUL and VARA, for online sequential decision-making in the presence of heavy-tailed rewards with only finite variances. For linear stochastic bandits, we address the issue of heavy-tailed rewards by modifying the adaptive Huber regression and proposing AdaOFUL. AdaOFUL achieves a state-of-the-art regret bound of $\widetilde{\mathcal{O}}\big(d\big(\sum_{t=1}^T \nu_{t}^2\big)^{1/2}+d\big)$ as if the rewards were uniformly bounded, where $\nu_{t}^2$ is the observed conditional variance of the reward at round $t$, $d$ is the feature dimension, and $\widetilde{\mathcal{O}}(\cdot)$ hides logarithmic dependence. Building upon AdaOFUL, we propose VARA for linear MDPs, which achieves a tighter variance-aware regret bound of $\widetilde{\mathcal{O}}(d\sqrt{H\mathcal{G}^*K})$. Here, $H$ is the length of episodes, $K$ is the number of episodes, and $\mathcal{G}^*$ is a smaller instance-dependent quantity that can be bounded by other instance-dependent quantities when additional structural conditions on the MDP are satisfied. Our regret bound is superior to the current state-of-the-art bounds in three ways: (1) it depends on a tighter instance-dependent quantity and has optimal dependence on $d$ and $H$, (2) we can obtain further instance-dependent bounds of $\mathcal{G}^*$ under additional structural conditions on the MDP, and (3) our regret bound is valid even when rewards have only finite variances, achieving a level of generality unmatched by previous works. Overall, our modified adaptive Huber regression algorithm may serve as a useful building block in the design of algorithms for online problems with heavy-tailed rewards.
Abstract（参考訳）: 本稿では,有限分散のみの重み付き報酬の存在下でのオンラインシーケンシャル意思決定のための2つのアルゴリズム,AdaOFULとVARAを提案する。線形確率的バンディットに対しては,適応的なフーバー回帰を修正し,アダオフルを提案することで,重み付き報酬の問題に対処する。 AdaOFUL は、$\widetilde{\mathcal{O}}\big(d\big(\sum_{t=1}^T \nu_{t}^2\big)^{1/2}+d\big)$ を、例えば報酬が一様有界であるかのように達成し、$\nu_{t}^2$ はラウンド$t$での報酬の条件分散、$d$ は特徴次元、$\widetilde{\mathcal{O}}(\cdot)$ は対数依存を隠蔽する。 AdaOFUL をベースとして線形 MDP に対する VARA を提案する。これは$\widetilde{\mathcal{O}}(d\sqrt{H\mathcal{G}^*K})$ のより厳密な分散を考慮した後悔境界を実現する。ここで、$H$ はエピソードの長さであり、$K$ はエピソード数であり、$\mathcal{G}^*$ は MDP 上の追加構造条件が満たされると、他のインスタンス依存量によって束縛される小さなインスタンス依存量である。 1) より厳密なインスタンス依存量に依存し、$d$と$H$に最適な依存を持ち、(2) MDP上の追加構造条件の下でさらに$\mathcal{G}^*$のインスタンス依存境界を得ることができ、(3) 報酬が有限分散しか持たない場合でも、後悔境界は有効である。全体として、修正した適応型フーバー回帰アルゴリズムは、重み付き報酬を伴うオンライン問題に対するアルゴリズムの設計において有用な構成要素となるかもしれない。

関連論文リスト

Catoni Contextual Bandits are Robust to Heavy-tailed Rewards [31.381627608971414]
頑健な統計量からカトニ推定器上にアルゴリズム的アプローチを構築する。我々は、累積的な報酬分散と対数的に報酬範囲の$R$にのみ依存する後悔境界を確立する。アルゴリズムはまた、対数的報酬範囲依存を伴う分散ベースのバウンダリも享受する。
論文参考訳（メタデータ） (2025-02-04T17:03:32Z)
Provably Adaptive Average Reward Reinforcement Learning for Metric Spaces [2.2984209387877628]
本研究では、状態-作用空間を適応的に離散化し、状態-作用空間の有望な領域に拡大するアルゴリズムZoRLを開発する。 ZoRLは実験において、他の最先端アルゴリズムよりも優れています。
論文参考訳（メタデータ） (2024-10-25T18:14:42Z)
Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism [1.4999444543328293]
本稿では,新しいコストと報酬関数推定器に基づくモデルベースアルゴリズムを提案する。我々のアルゴリズムは、$widetildemathcalO((bar C - bar C_b)-1H2.5 SsqrtAK)$の残念な上限を達成する。
論文参考訳（メタデータ） (2024-10-14T04:51:06Z)
Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
報酬分布の分散と$B_K$の分散を利用するアルゴリズムを提案する。 Restarted Weighted$textOFUL+$とRestarted$textSAVE+$の2つの新しいアルゴリズムを紹介します。特に、V_K$が$K$よりはるかに小さい場合、我々のアルゴリズムは、異なる設定下での非定常線形バンドレットの最先端結果よりも優れている。
論文参考訳（メタデータ） (2024-03-15T23:36:55Z)
Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
我々のアルゴリズムが $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping is linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ is the standardities of the state and action space。
論文参考訳（メタデータ） (2023-05-15T05:37:32Z)
Towards Theoretical Understanding of Inverse Reinforcement Learning [45.3190496371625]
逆強化学習(IRL)は、専門家が示す振る舞いを正当化する報酬関数を回復するアルゴリズムの強力なファミリーである。本稿では、生成モデルを用いた有限水平問題の場合のIRLの理論ギャップを解消する。
論文参考訳（メタデータ） (2023-04-25T16:21:10Z)
Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes [62.90204655228324]
我々は,後期マルコフ決定過程(LMDP)における強化学習(RL)の文脈を考慮した後悔の最小化について検討した。我々は,モデル最適化と値最適化の両手法でインスタンス化できる,新しいモデルベースアルゴリズムフレームワークを設計する。
論文参考訳（メタデータ） (2022-10-20T21:32:01Z)
Causal Bandits for Linear Structural Equation Models [58.2875460517691]
本稿では,因果図形モデルにおける最適な介入順序を設計する問題について検討する。グラフの構造は知られており、ノードは$N$である。頻繁性(UCBベース)とベイズ的設定に2つのアルゴリズムを提案する。
論文参考訳（メタデータ） (2022-08-26T16:21:31Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
我々はPessimistic vAlue iteRaTionとrEward Decomposition (PARTED)という新しいオフライン強化学習アルゴリズムを提案する。 PartEDは、最小2乗ベースの報酬再分配を通じて、ステップごとのプロキシ報酬に軌道を分解し、学習したプロキシ報酬に基づいて悲観的な値を実行する。私たちの知る限りでは、PartEDは、トラジェクティブな報酬を持つ一般のMDPにおいて、証明可能な効率のよい最初のオフラインRLアルゴリズムである。
論文参考訳（メタデータ） (2022-06-13T19:11:22Z)
Differentially Private Covariance Revisited [16.743341747437054]
差分プライバシー下での共分散推定のための3つの新しい誤差境界を提案する。対応するアルゴリズムは単純で効率的である。実験結果から, 先行作業よりも大幅な改善が得られた。
論文参考訳（メタデータ） (2022-05-28T03:53:11Z)
Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs [12.450760567361531]
オンライン学習問題では,低分散の活用がパフォーマンス保証の厳密化に重要な役割を果たしている。本研究は, 後悔の限界を著しく改善する新たな分析法を提案する。我々の分析は、新しい楕円型ポテンシャル数補題に依存している。
論文参考訳（メタデータ） (2021-11-05T06:47:27Z)
Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
線形バンドイットと線形混合決定プロセス(mdp)に対する分散認識信頼セットの構築方法を示す。線形バンドイットに対しては、$d を特徴次元とする$widetildeo(mathrmpoly(d)sqrt1 + sum_i=1ksigma_i2) が成り立つ。線形混合 MDP に対し、$widetildeO(mathrmpoly(d)sqrtK)$ regret bound を得る。
論文参考訳（メタデータ） (2021-01-29T18:57:52Z)
Curse of Dimensionality on Randomized Smoothing for Certifiable Robustness [151.67113334248464]
我々は、他の攻撃モデルに対してスムースな手法を拡張することは困難であることを示す。我々はCIFARに関する実験結果を示し,その理論を検証した。
論文参考訳（メタデータ） (2020-02-08T22:02:14Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。