Fugu-MT 論文翻訳(概要): Provably Efficient Generalized Lagrangian Policy Optimization for Safe Multi-Agent Reinforcement Learning

論文の概要: Provably Efficient Generalized Lagrangian Policy Optimization for Safe Multi-Agent Reinforcement Learning

arxiv url: http://arxiv.org/abs/2306.00212v1
Date: Wed, 31 May 2023 22:09:24 GMT
ステータス: 翻訳完了
システム内更新日: 2023-06-02 19:08:34.761019
Title: Provably Efficient Generalized Lagrangian Policy Optimization for Safe Multi-Agent Reinforcement Learning
Title（参考訳）: 安全マルチエージェント強化学習のための一般化ラグランジュ政策最適化
Authors: Dongsheng Ding and Xiaohan Wei and Zhuoran Yang and Zhaoran Wang and Mihailo R. Jovanovi\'c
Abstract要約: 制約付きマルコフゲームを用いたオンライン安全なマルチエージェント強化学習について検討する。我々は,このラグランジアン問題を解くための高信頼強化学習アルゴリズムを開発した。提案アルゴリズムは,オンラインミラー降下によるミニマックス決定主元変数と,投影勾配ステップによる双対変数を更新する。
参考スコア（独自算出の注目度）: 105.7510838453122
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We examine online safe multi-agent reinforcement learning using constrained Markov games in which agents compete by maximizing their expected total rewards under a constraint on expected total utilities. Our focus is confined to an episodic two-player zero-sum constrained Markov game with independent transition functions that are unknown to agents, adversarial reward functions, and stochastic utility functions. For such a Markov game, we employ an approach based on the occupancy measure to formulate it as an online constrained saddle-point problem with an explicit constraint. We extend the Lagrange multiplier method in constrained optimization to handle the constraint by creating a generalized Lagrangian with minimax decision primal variables and a dual variable. Next, we develop an upper confidence reinforcement learning algorithm to solve this Lagrangian problem while balancing exploration and exploitation. Our algorithm updates the minimax decision primal variables via online mirror descent and the dual variable via projected gradient step and we prove that it enjoys sublinear rate $ O((|X|+|Y|) L \sqrt{T(|A|+|B|)}))$ for both regret and constraint violation after playing $T$ episodes of the game. Here, $L$ is the horizon of each episode, $(|X|,|A|)$ and $(|Y|,|B|)$ are the state/action space sizes of the min-player and the max-player, respectively. To the best of our knowledge, we provide the first provably efficient online safe reinforcement learning algorithm in constrained Markov games.
Abstract（参考訳）: エージェントが期待する総報酬を最大化することにより競争する制約付きマルコフゲームを用いたオンラインセーフマルチエージェント強化学習について検討する。我々の焦点は、エージェント、対向報酬関数、確率的効用関数に未知な独立遷移関数を持つエピソードな2つのプレイヤーゼロサム制約マルコフゲームに限られる。このようなマルコフゲームでは、占有測度に基づいたアプローチを採用し、明示的な制約付きオンライン制約付き鞍点問題として定式化する。制約付き最適化においてラグランジュ乗算法を拡張し、最小決定原始変数と双対変数を持つ一般化ラグランジアンを作成することで制約に対処する。次に,探索と搾取のバランスを保ちながら,このラグランジュ問題を解くための高信頼強化学習アルゴリズムを開発した。提案アルゴリズムは,オンラインミラー降下によるミニマックス決定主元変数と投影勾配ステップによる双対変数を更新し,ゲームのT$エピソードをプレイした後の後悔と制約違反に対して,サブラインレート$O(|X|+|Y|) L \sqrt{T(|A|+|B|)})$を満足していることを証明する。ここで、$l$ は各エピソードの地平線であり、$(|x|,|a|)$ と $(|y|,|b|)$ はそれぞれ min-player と max-player の状態/アクション空間サイズである。我々の知識を最大限に活用するため、制約付きマルコフゲームにおけるオンライン安全強化学習アルゴリズムを初めて提供する。

論文の概要: Provably Efficient Generalized Lagrangian Policy Optimization for Safe Multi-Agent Reinforcement Learning

関連論文リスト