Fugu-MT 論文翻訳(概要): TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

論文の概要: TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

arxiv url: http://arxiv.org/abs/2605.11473v1
Date: Tue, 12 May 2026 03:40:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.552138
Title: TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing
Title（参考訳）: TOPPO:批判的バランスによるマルチタスク強化学習のためのPPOの再考
Authors: Yuanpeng Li, Gefei Lin, Annie Qu, Rui Miao,
Abstract要約: タスク間の勾配条件付けと学習のバランスを改善するモジュールセットであるTOPPOを提案する。 TOPPOは、公表されたSACファミリーやARSファミリーのベースラインよりも、平均とテールタスクのパフォーマンスが向上する。提案手法は, 適切な最適化により, MTRLの法外アプローチに対抗し, 越えることが可能であることを実証した。
参考スコア（独自算出の注目度）: 1.9552387050709823
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Soft Actor-Critic (SAC) and its variants dominate Multi-Task Reinforcement Learning (MTRL) due to their off-policy sample efficiency, while on-policy methods such as Proximal Policy Optimization (PPO) remain underexplored. We diagnose that PPO in MTRL suffers from a previously overlooked issue: critic-side gradient ill-conditioning, which may cause tail tasks to stall while easy tasks dominate the value function's updates. To address this, we propose TOPPO (Tail-Optimized PPO), a reformulation of PPO via Critic Balancing -- a set of modules that improve gradient conditioning and balance learning dynamics across tasks. Unlike prior approaches that rely on modular architectures or large models, TOPPO targets the optimization bottleneck within PPO itself. Empirically, TOPPO achieves stronger mean and tail-task performance than published SAC-family and ARS-family baselines while using substantially fewer parameters and environment steps on Meta-World+ benchmark. Notably, TOPPO matches or surpasses strong SAC baselines early in training and maintains superior performance at full budget. Ablations confirm the effectiveness of each module in TOPPO and provide insights into their interactions. Our results demonstrate that, with proper optimization, on-policy methods can rival or exceed off-policy approaches in MTRL, challenging the prevailing reliance on SAC and highlighting critic-side gradient conditioning as the central bottleneck.
Abstract（参考訳）: SAC(Soft Actor-Critic)とその変種は、非政治的なサンプル効率のためにMTRL(Multi-Task Reinforcement Learning)を支配し、一方、PPO(Proximal Policy Optimization)のようなオン・ポリティクスの手法はいまだ検討されていない。我々は、MTRLのPPOが以前見過ごされた問題に悩まされていることを診断する: 批判側勾配条件; テールタスクが停止し、簡単なタスクがバリュー関数の更新を支配している可能性がある。そこで本研究では,PPOの最適化であるTOPPO(Tail-Optimized PPO)を提案する。モジュールアーキテクチャや大規模モデルに依存する従来のアプローチとは異なり、TOPPOはPPO自体の最適化ボトルネックを目標としています。実証的には、TOPPOは、Meta-World+ベンチマークのパラメータと環境ステップをはるかに少なくしながら、発行されたSACファミリーやARSファミリーのベースラインよりも、平均とテールタスクのパフォーマンスを向上する。特に、TOPPOはトレーニングの初期段階で強力なSACベースラインと一致または超え、フル予算で優れたパフォーマンスを維持する。アブレーションはTOPPOにおける各モジュールの有効性を確認し、それらの相互作用に関する洞察を提供する。提案手法は, 適切な最適化により, MTRLの非政治的アプローチに対抗し, SACへの依存に挑戦し, 批判側勾配条件を中心的ボトルネックとして強調できることを示した。

論文の概要: TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

関連論文リスト