Fugu-MT 論文翻訳(概要): From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

論文の概要: From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.18841v1
Date: Wed, 13 May 2026 03:34:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:08.658899
Title: From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning
Title（参考訳）: 非定常強化学習における累積制約から適応型実行時安全制御へ
Authors: Timofey Tomashevskiy,
Abstract要約: Constraint Projection Safety Shield (CPSS)は、累積安全予算を実行中に適応的な状態レベルの制御制約に変換するランタイムメカニズムである。 CPSSは残りの安全予算を追跡し、それを許容されるリスクしきい値に予測し、予測される安全コストがアクティブなしきい値を超える政策措置をフィルタリングする。得られた遮蔽ポリシーを解析し、そのメカニズムが実行された動作に対する状態ごとの閾値満足度を保証することを示す。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Safety in reinforcement learning is often specified through cumulative cost constraints, but these trajectory-level guarantees do not directly prevent unsafe individual decisions, especially under nonstationarity. In continual and nonstationary settings, the difficulty is amplified because the risk associated with the same action can vary across contexts, while a fixed state-level threshold may be either too conservative or too weak. We propose Constraint Projection Safety Shield (CPSS), a runtime mechanism that converts a cumulative safety budget into adaptive state-level control constraints during execution. CPSS tracks the remaining safety budget, projects it into a time-varying admissible risk threshold, and filters policy actions whose predicted safety cost exceeds the active threshold. The threshold is adjusted online using contextual signals so that enforcement becomes stricter in more demanding or rapidly changing regimes and less restrictive when the available safety budget is sufficient. We analyze the resulting shielded policy and show that the mechanism guarantees per-state threshold satisfaction for executed actions, induces finite-horizon cumulative cost bounds, and yields a performance degradation bound in terms of intervention frequency and per-step reward distortion. We evaluate CPSS in nonstationary highway merging scenarios using highway-env. Across multiple seeds, CPSS substantially reduces proximity-based safety violations and increases separation margins while intervening selectively rather than dominating the learned policy. These results support adaptive budget-to-threshold projection as a practical way to transform cumulative safety specifications into effective local safety control for continual reinforcement learning systems.
Abstract（参考訳）: 強化学習の安全性はしばしば累積的なコスト制約によって特定されるが、これらの軌道レベルの保証は、特に非定常性の下では、安全でない個人の決定を直接妨げない。連続的および非定常的な設定では、同じアクションに関連するリスクがコンテキストによって異なるため、困難が増幅される。本研究では,累積安全予算を適応的状態レベル制御制約に変換するランタイム機構であるConstraint Projection Safety Shield (CPSS)を提案する。 CPSSは残りの安全予算を追跡し、それを許容されるリスクしきい値に予測し、予測される安全コストがアクティブなしきい値を超える政策措置をフィルタリングする。安全予算が十分であれば、より要求されたり、急速に変化する体制において、執行が厳格になるように、しきい値が文脈信号を使用してオンラインで調整される。得られた遮蔽ポリシーを分析し,そのメカニズムが実行動作に対する状態ごとの閾値満足度を保証し,有限水平累積コスト境界を誘導し,介入周波数とステップ毎の報酬歪みの観点から性能劣化を生じさせることを示す。非定常ハイウェイマージシナリオにおけるCPSSの評価をハイウェイ-envを用いて行う。複数の種にまたがって、CPSSは近接ベースの安全違反を著しく低減し、学習方針を支配せず選択的に介入しながら分離マージンを増大させる。これらの結果は,累積安全仕様を連続的強化学習システムのための効果的な局所的安全制御に変換するための実践的な方法として,適応的予算間投射を支援する。

論文の概要: From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

関連論文リスト