Fugu-MT 論文翻訳(概要): Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning

論文の概要: Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.00667v1
Date: Fri, 01 May 2026 13:46:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.975058
Title: Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning
Title（参考訳）: 強化学習における国家安全のための拡張ラグランジアン乗算器ネットワーク
Authors: Jiaming Zhang, Yujie Yang, Yao Lyu, Shengbo Eben Li, Liping Zhang,
Abstract要約: 本研究では、状態ワイド乗算器の安定学習のための拡張ラグランジアン乗算器ネットワーク(ALaM)フレームワークを提案する。まず、遅れた乗算器更新を補うために、拡張されたラグランジアンに2次ペナルティを導入する。第二に、乗算器ネットワークは二重目標に対する教師付き回帰によって訓練され、訓練を安定させ、収束を促進する。
参考スコア（独自算出の注目度）: 31.147991636318633
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Safety is a primary challenge in real-world reinforcement learning (RL). Formulating safety requirements as state-wise constraints has become a prominent paradigm. Handling state-wise constraints with the Lagrangian method requires a distinct multiplier for every state, necessitating neural networks to approximate them as a multiplier network. However, applying standard dual gradient ascent to multiplier networks induces severe training oscillations. This is because the inherent instability of dual ascent is exacerbated by network generalization -- local overshoots and delayed updates propagate to adjacent states, further amplifying policy fluctuations. Existing stabilization techniques are designed for scalar multipliers, which are inadequate for state-dependent multiplier networks. To address this challenge, we propose an augmented Lagrangian multiplier network (ALaM) framework for stable learning of state-wise multipliers. ALaM consists of two key components. First, a quadratic penalty is introduced into the augmented Lagrangian to compensate for delayed multiplier updates and establish the local convexity near the optimum, thereby mitigating policy oscillations. Second, the multiplier network is trained via supervised regression toward a dual target, which stabilizes training and promotes convergence. Theoretically, we show that ALaM guarantees multiplier convergence and thus recovers the optimal policy of the constrained problem. Building on this framework, we integrate soft actor-critic (SAC) with ALaM to develop the SAC-ALaM algorithm. Experiments demonstrate that SAC-ALaM outperforms state-of-the-art safe RL baselines in both safety and return, while also stabilizing training dynamics and learning well-calibrated multipliers for risk identification.
Abstract（参考訳）: 安全性は、現実世界の強化学習(RL)における主要な課題である。国家の制約として安全要件を定式化することは、顕著なパラダイムとなっている。ラグランジアン法で状態の制約を扱うには、各状態に対して明確な乗算器が必要であり、ニューラルネットワークがそれらを乗算器ネットワークとして近似する必要がある。しかし、乗算器ネットワークに標準双対勾配の上昇を適用すると、厳しい訓練振動が生じる。これは、二重登頂の固有の不安定性は、ネットワークの一般化によって悪化し、局所的なオーバーシュートと遅延更新が隣接する状態に伝播し、さらにポリシーの変動が増幅されるためである。既存の安定化技術は、状態依存型乗算器ネットワークでは不十分なスカラー乗算器のために設計されている。この課題に対処するために、状態ワイド乗算器の安定学習のための拡張ラグランジアン乗算器ネットワーク(ALaM)フレームワークを提案する。 ALaMは2つのキーコンポーネントから構成される。まず、遅れた乗算器更新を補うために拡張ラグランジアンに二次ペナルティを導入し、最適近傍の局所凸性を確立し、政策振動を緩和する。第二に、乗算器ネットワークは二重目標に対する教師付き回帰によって訓練され、訓練を安定させ、収束を促進する。理論的には、ALaMは乗算収束を保証し、制約された問題の最適ポリシーを回復する。この枠組みに基づいて,ソフトアクター・クリティック(SAC)をALaMと統合し,SAC-ALaMアルゴリズムを開発した。実験により、SAC-ALaMは安全性とリターンの両方において、最先端の安全なRLベースラインよりも優れており、トレーニングダイナミクスの安定化や、リスク識別のためのよく校正された乗算器の学習も行われている。

論文の概要: Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning

関連論文リスト