Fugu-MT 論文翻訳(概要): Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

論文の概要: Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

arxiv url: http://arxiv.org/abs/2604.11734v2
Date: Tue, 14 Apr 2026 07:22:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 14:01:13.527037
Title: Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving
Title（参考訳）: Multi-ORFT: 協調運転における多エージェント拡散計画のための安定オンライン強化ファインチューニング
Authors: Haojie Bai, Aimin Li, Ruoyu Yao, Xiongwei Zhao, Tingting Zhang, Xing Zhang, Lin Gao, and Jun Ma,
Abstract要約: シーン条件付き拡散事前学習とオンライン強化後訓練を併用したMulti-ORFTを提案する。プレトレーニングでは、アジェント間自己注意、クロスアテンション、AdaLN-Zeroベースのシーンコンディショニングを使用する。ポストトレーニングでは、オンライン最適化のための段階的に逆相対的な可能性を明らかにする2段階のMDPを定式化する。
参考スコア（独自算出の注目度）: 22.627579758896967
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion planners can model multimodal behaviors from demonstrations, but they often exhibit weak scene consistency and remain poorly aligned with closed-loop objectives; meanwhile, stable online post-training in reactive multi-agent environments remains difficult. We present Multi-ORFT, which couples scene-conditioned diffusion pre-training with stable online reinforcement post-training. In pre-training, the planner uses inter-agent self-attention, cross-attention, and AdaLN-Zero-based scene conditioning to improve scene consistency and road adherence of joint trajectories. In post-training, we formulate a two-level MDP that exposes step-wise reverse-kernel likelihoods for online optimization, and combine dense trajectory-level rewards with variance-gated group-relative policy optimization (VG-GRPO) to stabilize training. On the WOMD closed-loop benchmark, Multi-ORFT reduces collision rate from 2.04% to 1.89% and off-road rate from 1.68% to 1.36%, while increasing average speed from 8.36 to 8.61 m/s relative to the pre-trained planner, and it outperforms strong open-source baselines including SMART-large, SMART-tiny-CLSFT, and VBD on the primary safety and efficiency metrics. These results show that coupling scene-consistent denoising with stable online diffusion-policy optimization improves the reliability of closed-loop cooperative driving.
Abstract（参考訳）: クローズドループ協調運転は、安全と交通効率を改善しつつ、現実的なマルチモーダルなマルチエージェント軌道を生成するプランナーを必要とする。既存の拡散プランナーは、デモからマルチモーダルな振る舞いをモデル化できるが、しばしばシーンの一貫性が弱く、閉ループの目的と整合性に乏しい。シーン条件付き拡散事前学習とオンライン強化後訓練を併用したMulti-ORFTを提案する。プレトレーニングでは、アジェント間自己注意、クロスアテンション、AdaLN-Zeroベースのシーンコンディショニングを使用して、共同軌道のシーン一貫性とロードアテンデンスを改善する。ポストトレーニングにおいて、オンライン最適化のための段階的に逆カーネルの可能性を明らかにする2段階のMDPを定式化し、高密度な軌道レベルの報酬と分散ゲート型グループ相対ポリシー最適化(VG-GRPO)を組み合わせてトレーニングを安定化させる。 WOMDのクローズドループベンチマークでは、Multi-ORFTは衝突速度を2.04%から1.89%、オフロード速度を1.68%から1.36%に減らし、平均速度を8.36から8.61m/sに引き上げ、SMART-large、SMART-tiny-CLSFT、VBDといった強力なオープンソースベースラインを上回っている。これらの結果から,安定なオンライン拡散・ポリティクス最適化によるシーン一貫性デノナイズにより,クローズドループ協調運転の信頼性が向上することが示唆された。

論文の概要: Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

関連論文リスト