Fugu-MT 論文翻訳(概要): Smoothing Policy Iteration for Zero-sum Markov Games

論文の概要: Smoothing Policy Iteration for Zero-sum Markov Games

arxiv url: http://arxiv.org/abs/2212.01623v1
Date: Sat, 3 Dec 2022 14:39:06 GMT
ステータス: 翻訳完了
システム内更新日: 2022-12-06 18:10:28.785460
Title: Smoothing Policy Iteration for Zero-sum Markov Games
Title（参考訳）: ゼロサムマルコフゲームのための平滑化ポリシーイテレーション
Authors: Yangang Ren, Yao Lyu, Wenxuan Wang, Shengbo Eben Li, Zeyang Li, Jingliang Duan
Abstract要約: ゼロサムMGの解法としてスムージングポリシロバストネス(SPI)アルゴリズムを提案する。特に、対向ポリシーは、作用空間上の効率的なサンプリングを可能にする重み関数として機能する。また,SPIを関数近似で拡張することにより,Smooth adversarial Actor-critic (SaAC) と呼ばれるモデルベースアルゴリズムを提案する。
参考スコア（独自算出の注目度）: 9.158672246275348
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Zero-sum Markov Games (MGs) has been an efficient framework for multi-agent systems and robust control, wherein a minimax problem is constructed to solve the equilibrium policies. At present, this formulation is well studied under tabular settings wherein the maximum operator is primarily and exactly solved to calculate the worst-case value function. However, it is non-trivial to extend such methods to handle complex tasks, as finding the maximum over large-scale action spaces is usually cumbersome. In this paper, we propose the smoothing policy iteration (SPI) algorithm to solve the zero-sum MGs approximately, where the maximum operator is replaced by the weighted LogSumExp (WLSE) function to obtain the nearly optimal equilibrium policies. Specially, the adversarial policy is served as the weight function to enable an efficient sampling over action spaces.We also prove the convergence of SPI and analyze its approximation error in $\infty -$norm based on the contraction mapping theorem. Besides, we propose a model-based algorithm called Smooth adversarial Actor-critic (SaAC) by extending SPI with the function approximations. The target value related to WLSE function is evaluated by the sampled trajectories and then mean square error is constructed to optimize the value function, and the gradient-ascent-descent methods are adopted to optimize the protagonist and adversarial policies jointly. In addition, we incorporate the reparameterization technique in model-based gradient back-propagation to prevent the gradient vanishing due to sampling from the stochastic policies. We verify our algorithm in both tabular and function approximation settings. Results show that SPI can approximate the worst-case value function with a high accuracy and SaAC can stabilize the training process and improve the adversarial robustness in a large margin.
Abstract（参考訳）: ゼロサムマルコフゲーム(MGs)はマルチエージェントシステムとロバスト制御のための効率的なフレームワークであり、均衡ポリシーを解決するためにミニマックス問題を構築している。現在、この定式化は、最大演算子を主かつ正確に解き、最悪の値関数を計算するための表形式でよく研究されている。しかし、このような手法を複雑なタスクを扱うように拡張するのは簡単ではない。本稿では、ゼロサムMGを近似的に解くためのスムージングポリシー反復(SPI)アルゴリズムを提案し、最大演算子は重み付きLogSumExp(WLSE)関数に置き換えられ、ほぼ最適な平衡ポリシを得る。特に, 作用空間上の効率的なサンプリングを可能にする重み関数として, SPIの収束を証明し, 縮合写像定理に基づいて, 近似誤差を$\infty -$normで解析する。さらに,SPIを関数近似で拡張することにより,Smooth adversarial Actor-critic (SaAC) と呼ばれるモデルベースアルゴリズムを提案する。 WLSE関数に関する目標値をサンプルトラジェクトリで評価し,その値関数を最適化するために平均2乗誤差を構築し,また,この勾配-進-進法を併用して,対角法と対向法を併用する。さらに,モデルベース勾配バックプロパゲーションにおける再パラメータ化手法を取り入れ,確率政策からのサンプリングによる勾配の消失を防止する。本アルゴリズムを表と関数の近似設定の両方で検証する。その結果,SPIは最悪値関数を高い精度で近似することができ,SACはトレーニングプロセスを安定させ,対向ロバスト性を大きく向上させることができることがわかった。

関連論文リスト

Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
凸最適化問題を解くための新しい勾配のないアルゴリズムを提案する。このような問題は医学、物理学、機械学習で発生する。両種類の雑音下で提案アルゴリズムの収束保証を行う。
論文参考訳（メタデータ） (2024-11-21T10:26:17Z)
Nonconvex Stochastic Bregman Proximal Gradient Method for Nonconvex Composite Problems [9.202586157819693]
非合成対象関数の勾配法は、典型的には微分可能部分のリプシッツ滑らかさに依存する。非目的の非Lipschitz勾配を扱う近似モデルを提案する。ステップ選択感度の観点から最適なロバスト性が得られることを示す。
論文参考訳（メタデータ） (2023-06-26T08:54:46Z)
Covariance Matrix Adaptation Evolutionary Strategy with Worst-Case Ranking Approximation for Min--Max Optimization and its Application to Berthing Control Tasks [19.263468901608785]
我々は mathbbX max_y in mathbbYf(x,y)$ の連続 min-max 最適化問題 $min_x を考える。最悪の対象関数である$F(x) = max_y f(x,y)$を直接最小化する新しい手法を提案する。
論文参考訳（メタデータ） (2023-03-28T15:50:56Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
逆強化学習(IRL)は報酬関数と関連する最適ポリシーを回復することを目的としている。 IRLの多くのアルゴリズムは本質的にネスト構造を持つ。我々は、報酬推定精度を損なわないIRLのための新しいシングルループアルゴリズムを開発した。
論文参考訳（メタデータ） (2022-10-04T17:13:45Z)
META-STORM: Generalized Fully-Adaptive Variance Reduced SGD for Unbounded Functions [23.746620619512573]
最近の研究は「メガバッチ」の勾配を計算する効果を克服している作業は、競争力のあるディープラーニングタスクで更新された後に広く使用される。
論文参考訳（メタデータ） (2022-09-29T15:12:54Z)
Implicitly Regularized RL with Implicit Q-Values [42.87920755961722]
Q$関数は多くの強化学習(RL)アルゴリズムにおいて中心的な量であり、RLエージェントは(ソフト)グレーディポリシーに従って振る舞う。対数政治と値関数の和として、暗黙的に$Q$-関数をパラメータ化することを提案する。我々は,大規模アクション空間に適した実用的な非政治的深部RLアルゴリズムを導出し,ポリシーと$Q$値とのソフトマックス関係を強制する。
論文参考訳（メタデータ） (2021-08-16T12:20:47Z)
Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
高精度リコール曲線(AUPRC)に基づく領域の最適化について検討し,不均衡なタスクに広く利用されている。我々は、$O (1/epsilon4)$のより優れた反復による、$epsilon$定常解を見つけるための新しい運動量法を開発する。また,O(1/epsilon4)$と同じ複雑さを持つ適応手法の新たなファミリを設計し,実際により高速な収束を享受する。
論文参考訳（メタデータ） (2021-07-02T16:21:52Z)
Sparse Bayesian Learning via Stepwise Regression [1.2691047660244335]
我々は、RMP(Relevance Matching Pursuit)と呼ばれるSBLのための座標加算アルゴリズムを提案する。ノイズ分散パラメータがゼロになるにつれて、RMPはステップワイド回帰と驚くべき関係を示す。ステップワイド回帰アルゴリズムの新たな保証を導き、RMPにも光を当てる。
論文参考訳（メタデータ） (2021-06-11T00:20:27Z)
Stochastic Optimization of Areas Under Precision-Recall Curves with Provable Convergence [66.83161885378192]
ROC(AUROC)と精度リコール曲線(AUPRC)の下の領域は、不均衡問題に対する分類性能を評価するための一般的な指標である。本稿では,深層学習のためのAUPRCの最適化手法を提案する。
論文参考訳（メタデータ） (2021-04-18T06:22:21Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
本稿では,学習者が生成モデルにアクセスできる場合の,割引マルコフ決定(MDP)における最良の政治的識別の問題について検討する。最先端アルゴリズムの利点を論じ、解説する。
論文参考訳（メタデータ） (2020-09-28T15:22:24Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
制約付きマルコフ決定過程(CMDP)を用いた安全強化学習(SRL)問題について検討する。本稿では,関数近似設定において,安全な探索を行うCMDPの効率の良いオンラインポリシー最適化アルゴリズムを提案する。
論文参考訳（メタデータ） (2020-03-01T17:47:03Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。