Fugu-MT 論文翻訳(概要): Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution

論文の概要: Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution

arxiv url: http://arxiv.org/abs/2509.14816v1
Date: Thu, 18 Sep 2025 10:18:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-19 17:26:53.164002
Title: Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution
Title（参考訳）: グラディエント・コンフリクト・レゾリューションによるスケーラブル多目的ロボット強化学習
Authors: Humphrey Munn, Brendan Tidd, Peter Böhm, Marcus Gallagher, David Howard,
Abstract要約: タスクベースの報酬と、現実的な行動に対するポリシーを規則化する用語の衝突を解決する方法を示す。本稿では、アクター更新を客観的な勾配に分解するアクター批判最適化の修正であるGCR-PPOを提案する。 GCR-PPOは、最大で9.5%の改善を達成し、より高度な改善を観察する、大規模な近位政策最適化を改善する。
参考スコア（独自算出の注目度）: 2.359524447776588
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning (RL) robot controllers usually aggregate many task objectives into one scalar reward. While large-scale proximal policy optimisation (PPO) has enabled impressive results such as robust robot locomotion in the real world, many tasks still require careful reward tuning and are brittle to local optima. Tuning cost and sub-optimality grow with the number of objectives, limiting scalability. Modelling reward vectors and their trade-offs can address these issues; however, multi-objective methods remain underused in RL for robotics because of computational cost and optimisation difficulty. In this work, we investigate the conflict between gradient contributions for each objective that emerge from scalarising the task objectives. In particular, we explicitly address the conflict between task-based rewards and terms that regularise the policy towards realistic behaviour. We propose GCR-PPO, a modification to actor-critic optimisation that decomposes the actor update into objective-wise gradients using a multi-headed critic and resolves conflicts based on the objective priority. Our methodology, GCR-PPO, is evaluated on the well-known IsaacLab manipulation and locomotion benchmarks and additional multi-objective modifications on two related tasks. We show superior scalability compared to parallel PPO (p = 0.04), without significant computational overhead. We also show higher performance with more conflicting tasks. GCR-PPO improves on large-scale PPO with an average improvement of 9.5%, with high-conflict tasks observing a greater improvement. The code is available at https://github.com/humphreymunn/GCR-PPO.
Abstract（参考訳）: 強化学習(RL)ロボットコントローラは通常、多くのタスク目標を1つのスカラー報酬に集約する。大規模近位政策最適化(PPO)は、実世界におけるロバストなロボットの移動のような印象的な結果を実現する一方で、多くのタスクは依然として注意深い報酬調整を必要としており、局所的なオプティマに対して脆弱である。調整コストとサブ最適度は、目的の数が増加し、スケーラビリティが制限される。報酬ベクトルとそれらのトレードオフをモデル化することはこれらの問題に対処することができるが、計算コストと最適化の難しさのため、ロボット工学の多目的手法はロボット工学において過小評価されている。本研究では,タスク目標のスキャラライズから生じる各目標に対する勾配貢献の相違について検討する。特に、タスクベースの報酬と、現実的な行動に対するポリシーを規則化する用語の衝突に明示的に対処する。 GCR-PPOは、アクター更新を多面的批評家を用いて客観的な勾配に分解し、目的的優先度に基づいて競合を解決するアクター批判最適化の修正である。我々の手法であるGCR-PPOは、よく知られたIsaacLabの操作と移動のベンチマークと、関連する2つのタスクに対する追加の多目的修正に基づいて評価される。計算オーバーヘッドが大きくない並列PPO(p = 0.04)と比較して優れたスケーラビリティを示す。また、より矛盾するタスクでより高いパフォーマンスを示します。 GCR-PPOは、大規模PPOを平均で9.5%改善し、高いコンフリクトなタスクで改善を観察する。コードはhttps://github.com/humphreymunn/GCR-PPOで公開されている。

論文の概要: Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution

関連論文リスト