Fugu-MT 論文翻訳(概要): Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

論文の概要: Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

arxiv url: http://arxiv.org/abs/2503.08934v3
Date: Mon, 24 Mar 2025 01:36:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-03-25 14:21:41.291044
Title: Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model
Title（参考訳）: 生成モデルを用いた反復CVaR強化学習における準最適サンプル複雑度
Authors: Zilong Deng, Simon Khan, Shaofeng Zou,
Abstract要約: リスクに敏感な強化学習(RL)のサンプル複雑性問題を生成モデルを用いて検討する。この問題のサンプル複雑性に基づいて,上界と下界にほぼ一致する境界を定めている。
参考スコア（独自算出の注目度）: 13.582475656749775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we study the sample complexity problem of risk-sensitive Reinforcement Learning (RL) with a generative model, where we aim to maximize the Conditional Value at Risk (CVaR) with risk tolerance level $\tau$ at each step, a criterion we refer to as Iterated CVaR. We first build a connection between Iterated CVaR RL and $(s, a)$-rectangular distributional robust RL with a specific uncertainty set for CVaR. We establish nearly matching upper and lower bounds on the sample complexity of this problem. Specifically, we first prove that a value iteration-based algorithm, ICVaR-VI, achieves an $\epsilon$-optimal policy with at most $\tilde{O} \left(\frac{SA}{(1-\gamma)^4\tau^2\epsilon^2} \right)$ samples, where $\gamma$ is the discount factor, and $S, A$ are the sizes of the state and action spaces. Furthermore, when $\tau \geq \gamma$, the sample complexity improves to $\tilde{O} \left( \frac{SA}{(1-\gamma)^3\epsilon^2} \right)$. We further show a minimax lower bound of $\tilde{O} \left(\frac{(1-\gamma \tau)SA}{(1-\gamma)^4\tau\epsilon^2} \right)$. For a fixed risk level $\tau \in (0,1]$, our upper and lower bounds match, demonstrating the tightness and optimality of our analysis. We also investigate a limiting case with a small risk level $\tau$, called Worst-Path RL, where the objective is to maximize the minimum possible cumulative reward. We develop matching upper and lower bounds of $\tilde{O} \left(\frac{SA}{p_{\min}} \right)$, where $p_{\min}$ denotes the minimum non-zero reaching probability of the transition kernel.
Abstract（参考訳）: 本研究では,リスクに敏感な強化学習(RL)のサンプル複雑性問題を生成モデルを用いて検討し,リスク許容レベルが$\tau$の条件値(CVaR)を最大化することを目的とした。まず,反復CVaR RLと$(s, a)$-正方形分布ロバストRLとの接続をCVaRの特定の不確実性セットで構築する。この問題のサンプル複雑性に基づいて,上界と下界にほぼ一致する境界を定めている。具体的には、値反復に基づくアルゴリズム ICVaR-VI が、少なくとも$\tilde{O} \left(\frac{SA}{(1-\gamma)^4\tau^2\epsilon^2} \right)$サンプルで$\epsilon$-optimal Policy を達成し、$\gamma$ が割引係数であり、$S, A$ が状態と行動空間のサイズであることを示す。さらに、$\tau \geq \gamma$ の場合、サンプルの複雑さは $\tilde{O} \left( \frac{SA}{(1-\gamma)^3\epsilon^2} \right)$ に改善される。さらに、$\tilde{O} \left(\frac{(1-\gamma \tau)SA}{(1-\gamma)^4\tau\epsilon^2} \right)$のミニマックス下界を示す。固定リスクレベル$\tau \in (0,1]$の場合、上と下の境界は一致し、分析の厳密性と最適性を示す。また、最小限の累積報酬を最大化することを目的とした、Worst-Path RLと呼ばれるリスクレベル$\tau$の制限ケースについても検討する。我々は、$\tilde{O} \left(\frac{SA}{p_{\min}} \right)$ の上界と下界のマッチングを開発する。

関連論文リスト

Near-Optimal Bounds for Learning Gaussian Halfspaces with Random Classification Noise [50.64137465792738]
この問題に対する効率的なSQアルゴリズムは、少なくとも$Omega(d1/2/(maxp, epsilon)2)$. のサンプル複雑性を必要とする。我々の下限は、この1/epsilon$に対する二次的依存は、効率的なアルゴリズムに固有のものであることを示唆している。
論文参考訳（メタデータ） (2023-07-13T18:59:28Z)
Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes [21.77276136591518]
我々はマルコフ決定過程(MDPs)のための証明可能なモデルフリー強化学習(RL)アルゴリズムを開発した。シミュレータ設定では,$widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$サンプルを用いて,$epsilon$-optimal Policyを求める。
論文参考訳（メタデータ） (2023-06-28T17:43:19Z)
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
本研究は,リスク許容度が$tau$のCVaR(Conditional Value at Risk)の目的に着目し,リスクに敏感な強化学習(RL)について検討する。ミニマックスCVaRの後悔率は$Omega(sqrttau-1AK)$で、$A$はアクションの数、$K$はエピソード数である。我々は,このアルゴリズムが連続性仮定の下で$widetilde O(tau-1sqrtSAK)$の最適後悔を達成し,一般に近似することを示す。
論文参考訳（メタデータ） (2023-02-07T02:22:31Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
縮退した線形マルコフ+デルタ決定における最適同定問題について, 生成モデルに基づく固定信頼度設定における検討を行った。複雑な非最適化プログラムの解としての下位境界は、そのようなアルゴリズムを考案する出発点として用いられる。
論文参考訳（メタデータ） (2022-08-11T04:12:50Z)
Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure [9.759209713196718]
我々は、対応する最適$Q*$関数が低ランクであるMDPのクラスを考える。より強い低階構造仮定の下では、生成モデル(LR-MCPI)と低階経験値イテレーション(LR-EVI)が、ランクに対して$tildeOleft((|S|+|A|)mathrmpoly(d,H)/epsilon2right)$の所望のサンプル複雑性を実現することが示されている。
論文参考訳（メタデータ） (2022-06-07T20:39:51Z)
Settling the Sample Complexity of Model-Based Offline Reinforcement Learning [50.5790774201146]
オフライン強化学習(RL)は、事前収集されたデータを用いて、さらなる探索を行わずに学習する。事前のアルゴリズムや分析は、最適なサンプルの複雑さに悩まされるか、サンプルの最適性に到達するために高いバーンインコストがかかるかのいずれかである。モデルベース(あるいは"プラグイン")アプローチは,バーンインコストを伴わずに,最小限のサンプル複雑性を実現することを実証する。
論文参考訳（メタデータ） (2022-04-11T17:26:19Z)
Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
タスクに依存しない強化学習のための効率的なアルゴリズムを提案する。このアルゴリズムは1/epsilon cdot (H3SA / rho + H4 S2 A) の$widetildemathcalOのみを探索する。情報理論上、この境界は$rho Theta (1/(HS))$と$H>1$に対してほぼ厳密であることを示す。
論文参考訳（メタデータ） (2021-08-11T20:42:46Z)
Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity [59.34067736545355]
S$状態、$A$アクション、割引係数$gamma in (0,1)$、近似しきい値$epsilon > 0$の MDP が与えられた場合、$epsilon$-Optimal Policy を学ぶためのモデルなしアルゴリズムを提供する。十分小さな$epsilon$の場合、サンプルの複雑さで改良されたアルゴリズムを示す。
論文参考訳（メタデータ） (2020-06-06T13:34:41Z)
Curse of Dimensionality on Randomized Smoothing for Certifiable Robustness [151.67113334248464]
我々は、他の攻撃モデルに対してスムースな手法を拡張することは困難であることを示す。我々はCIFARに関する実験結果を示し,その理論を検証した。
論文参考訳（メタデータ） (2020-02-08T22:02:14Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。