Fugu-MT 論文翻訳(概要): Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

論文の概要: Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

arxiv url: http://arxiv.org/abs/2606.04812v1
Date: Wed, 03 Jun 2026 12:36:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.75144
Title: Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees
Title（参考訳）: ほぼ安全な保証者によるリスク対応強化学習のためのシナリオ生成
Authors: Mohit Prashant, Arvind Easwaran,
Abstract要約: 強化学習 (Reinforcement Learning, RL) ポリシーは、未知または安全でない振る舞いをもたらす過渡摂動への感受性を示す。政策検証の方法は、安全制約に対する政策軌跡をサンプリングすることによって確率的障壁証明を構築することである。可変オートエンコーダ (VAE) を用いて, 遭遇した状態空間の分布を近似し, 上界と下界のバリア・サーティフィケートを構成する。
参考スコア（独自算出の注目度）: 3.959033903731638
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unknown or unsafe behaviour. A method of policy verification is to construct probabilistic barrier-certificates by sampling policy trajectories with respect to safety constraints, thereby demarcating known safe behaviour from unknown behaviour. Obtaining tight upper and lower bounds on the probability of violation of these constraints may be difficult if the policy is susceptible to transition uncertainty or perturbation that places the agent in insufficiently explored states. To address this, we approximate the distribution of the encountered state-space using a variational autoencoder (VAE) and construct upper and lower-bound barrier-certificates using latent characteristics of states to optimize for regions of known, safe behaviour with high confidence. We frame this in our work as a dual optimization problem where the lower-bound barrier-certificate presents a more conservative estimate of the safe region than the upper-bound barrier-certificate. Sampling states that lie within the set difference of the two during training, i.e. the non-robust region, allows us to tighten the upper and lower bounds to provide sharper probabilistic guarantees on safety. Within our study, we describe the guarantees placed and demonstrate the tightness of our bounds experimentally.
Abstract（参考訳）: 安全性の確保は現実世界における強化学習(RL)エージェントの展開に不可欠であり、特に深いRLを用いて学習したポリシーは、未知または不安全な振る舞いをもたらす過渡的摂動への感受性を示す。政策検証の方法は、安全制約に関する政策トラジェクトリをサンプリングすることにより、確率的バリア認証を構築することで、未知の行動から既知の安全行動を切り離すことである。これらの制約に違反する可能性について、厳しい上と下の境界を持つことは、エージェントを不十分に探索された状態に配置する移行の不確実性や摂動の影響を受けやすい場合、困難である。これを解決するために, 可変オートエンコーダ(VAE)を用いて, 遭遇した状態空間の分布を近似し, 状態の潜時特性を用いて上および下界バリア特性を構築し, 信頼性の高い安全行動領域を最適化する。我々はこれを、上界バリア認証よりも安全な領域のより保守的な推定を下界バリア認証が提示する双対最適化問題として検討する。サンプリング状態は、トレーニング中に2つの設定された違い、すなわち非ロバスト領域内にあるので、上と下の境界を締め付け、安全性をより高い確率論的保証を提供することができます。本研究では,提案する保証について述べるとともに,その厳密さを実験的に示す。

論文の概要: Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

関連論文リスト