Fugu-MT 論文翻訳(概要): ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

論文の概要: ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.23562v1
Date: Fri, 22 May 2026 12:29:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.342952
Title: ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning
Title（参考訳）: ARMS:スパース・リバース・マルチエージェント強化学習のための自動リワード整形
Authors: Elie Abboud, Oren Gal,
Abstract要約: マルチエージェント強化学習のためのマルチエージェントシステム(ARMS)における自動リワード整形を提案する。 ARMSは、軌跡ランキングを通じて、粗い環境報酬から密な整形信号を学習する。我々は,ARMSが政策学習と報奨学習を交互に交互に行い,エージェント間の整形パラメータを効率よく共有することを示した。
参考スコア（独自算出の注目度）: 2.2801444394060257
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in the multi-agent setting it must preserve the strategic structure of the problem rather than merely improve short-term optimization. We propose Automatic Reward-shaping in Multi-agent Systems (ARMS), a self-supervised reward shaping framework for MARL that learns dense shaping signals from sparse environmental rewards through trajectory ranking. Since single-agent trajectory-ranking guarantees do not directly transfer to MARL, we reformulate policy invariance through conditional best-response reasoning, and show that if certain conditions hold, then using shaping rewards preserves each agent's best-response set under fixed opponent policies, and consequently preserve the set of Nash equilibria. Guided by this perspective, ARMS alternates between policy learning and reward learning while sharing shaping parameters across agents for efficiency. Experiments in a partially observable multi-agent pathfinding domain show that ARMS improves sampling efficiency under increasing reward sparsity and agent count, generalizes to unseen environments, and reveals a MARL-specific failure mode in which limited exploration and coupled policy--reward dynamics induce oscillatory behavior. Increasing exploration mitigates this effect and stabilizes learning. To the best of our knowledge, ARMS is the first automatic reward shaping framework for MARL whose design is motivated by a game-theoretic equilibrium-preservation result.
Abstract（参考訳）: スパース報酬はマルチエージェント強化学習(MARL)において大きなボトルネックであり、同時学習は非定常性を誘導し、特に報酬設計を繊細にする。逆整形は学習を加速させるが、マルチエージェント環境では、短期最適化を単に改善するのではなく、問題の戦略的構造を維持する必要がある。 MARLのための自己教師型報酬形成フレームワークであるMulti-Adnt Systems (ARMS) における自動報酬形成を提案する。単一エージェントの軌道レベルの保証は直接MARLに遷移しないので、条件付きベストレスポンス推論によってポリシーの不変性を再構成し、ある条件が成立すれば、整形報酬を用いることで、固定された反対ポリシーの下で各エージェントのベストレスポンスセットを保存し、ナッシュ均衡の集合を保存することを示す。この視点で導かれたARMSは、政策学習と報酬学習を交互に行い、エージェント間の整形パラメータを共有して効率を上げる。部分的に観測可能なマルチエージェントパスフィンディング領域の実験により、ARMSは、報酬空間とエージェント数の増加によるサンプリング効率の向上、見えない環境への一般化、および、限られた探索と結合されたポリシ-リワードダイナミクスが振動挙動を誘発するMARL固有の障害モードを明らかにする。探索の増加は、この効果を緩和し、学習を安定化させる。我々の知る限り、ARMSはゲーム理論平衡保存結果に動機づけられたMARLのための最初の自動報酬形成フレームワークである。

論文の概要: ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

関連論文リスト