Fugu-MT 論文翻訳(概要): Trust-Region Diffusion Policies for Massively Parallel On-Policy RL

論文の概要: Trust-Region Diffusion Policies for Massively Parallel On-Policy RL

arxiv url: http://arxiv.org/abs/2606.15260v1
Date: Sat, 13 Jun 2026 11:35:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:33.166337
Title: Trust-Region Diffusion Policies for Massively Parallel On-Policy RL
Title（参考訳）: 大規模並列オンポリシィRLのための信頼緩和拡散政策
Authors: Huy Le, Onur Celik, Denis Blessing, Tai Hoang, Claas A Voelcker, Axel Brunnbauer, Felix Richter, Michael Volpp, Gerhard Neumann,
Abstract要約: 拡散モデルはより表現力のあるポリシークラスを提供し、挑戦的な制御問題に対して強い性能を示した。ほとんどの拡散型RL法は、オフラインまたはオフラインの訓練用に設計されている。トラスト領域拡散政策 (TruDi) は, 大規模並列シミュレーションによるオンラインRLの拡散政策を可能にする。
参考スコア（独自算出の注目度）: 24.462107727661223
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Reinforcement learning with massively parallel simulations has become a standard framework for developing robust, deployable policies; however, most existing approaches still rely on simple Gaussian policy parameterizations. Diffusion models provide a more expressive policy class and have shown strong performance on challenging control problems, yet most diffusion-based RL methods are designed for offline or off-policy training. In this work, we ask whether diffusion policies can be trained effectively in the massively parallel, on-policy regime. To this end, we introduce Trust-region Diffusion Policies (TruDi), which enables diffusion policies for on-policy RL with massively parallel simulations. This setting is particularly challenging because the data distribution changes quickly across updates, making stable training with complex policies difficult. TruDi addresses this by integrating a trust-region optimization rule to enforce a KL-divergence constraint over the entire diffusion trajectory. Empirically, we evaluate TruDi on a diverse set of 4 massively parallel RL benchmarks comprising a total of 73 tasks. Across these tasks, TruDi consistently outperforms or is on-par with strong baselines on standard tasks and achieves clear gains on more challenging humanoid control tasks, establishing a strong new baseline for massively parallel on-policy RL.
Abstract（参考訳）: 大規模並列シミュレーションによる強化学習は、堅牢でデプロイ可能なポリシーを開発するための標準フレームワークとなっているが、既存のほとんどのアプローチは依然として単純なガウスのポリシーパラメータ化に依存している。拡散モデルは、より表現力のあるポリシークラスを提供し、挑戦的な制御問題に対して強い性能を示してきたが、ほとんどの拡散に基づくRL法は、オフラインまたは非政治的な訓練のために設計されている。本研究は,大規模に平行な政治体制において,拡散政策を効果的に訓練できるかどうかを問うものである。そこで我々はTruDi(TruDi)を導入し,大規模並列シミュレーションによるオンラインRLの拡散政策を実現する。この設定は、更新間でデータ配布が急速に変化し、複雑なポリシによる安定したトレーニングが困難になるため、特に難しい。 TruDiはこの問題に対処するため、信頼領域最適化ルールを統合して、拡散軌道全体に対するKL分割制約を強制する。実験により,TruDiを73タスクからなる4つの大規模並列RLベンチマークで評価した。これらのタスク全体では、TruDiは、標準タスクの強力なベースラインと、より困難なヒューマノイド制御タスクの明確なゲインを一貫して上回り、非常に並列なRLのための強力な新しいベースラインを確立する。

論文の概要: Trust-Region Diffusion Policies for Massively Parallel On-Policy RL

関連論文リスト