Fugu-MT 論文翻訳(概要): Dichotomous Diffusion Policy Optimization

論文の概要: Dichotomous Diffusion Policy Optimization

arxiv url: http://arxiv.org/abs/2601.00898v1
Date: Wed, 31 Dec 2025 16:56:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-06 16:25:21.83708
Title: Dichotomous Diffusion Policy Optimization
Title（参考訳）: Dichotomous Diffusion Policy Optimization
Authors: Ruiming Liang, Yinan Zheng, Kexin Zheng, Tianyi Tan, Jianxiong Li, Liyuan Mao, Zhihao Wang, Guang Chen, Hangjun Ye, Jingjing Liu, Jinqiao Wang, Xianyuan Zhan,
Abstract要約: DIPOLEは、安定かつ制御可能な拡散ポリシー最適化のために設計された新しいRLアルゴリズムである。また、DIPOLEを使用して、エンドツーエンドの自動運転のための大規模なビジョン言語アクションモデルをトレーニングしています。
参考スコア（独自算出の注目度）: 46.51375996317989
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion-based policies have gained growing popularity in solving a wide range of decision-making tasks due to their superior expressiveness and controllable generation during inference. However, effectively training large diffusion policies using reinforcement learning (RL) remains challenging. Existing methods either suffer from unstable training due to directly maximizing value objectives, or face computational issues due to relying on crude Gaussian likelihood approximation, which requires a large amount of sufficiently small denoising steps. In this work, we propose DIPOLE (Dichotomous diffusion Policy improvement), a novel RL algorithm designed for stable and controllable diffusion policy optimization. We begin by revisiting the KL-regularized objective in RL, which offers a desirable weighted regression objective for diffusion policy extraction, but often struggles to balance greediness and stability. We then formulate a greedified policy regularization scheme, which naturally enables decomposing the optimal policy into a pair of stably learned dichotomous policies: one aims at reward maximization, and the other focuses on reward minimization. Under such a design, optimized actions can be generated by linearly combining the scores of dichotomous policies during inference, thereby enabling flexible control over the level of greediness.Evaluations in offline and offline-to-online RL settings on ExORL and OGBench demonstrate the effectiveness of our approach. We also use DIPOLE to train a large vision-language-action (VLA) model for end-to-end autonomous driving (AD) and evaluate it on the large-scale real-world AD benchmark NAVSIM, highlighting its potential for complex real-world applications.
Abstract（参考訳）: 拡散に基づく政策は、推論において優れた表現力と制御可能な生成のために、幅広い意思決定課題を解決することで人気が高まっている。しかし、強化学習(RL)を用いた大規模な拡散政策を効果的に訓練することは困難である。既存の手法では、価値目標を直接最大化することによる不安定なトレーニングに苦しむか、粗ガウス確率近似に頼って計算問題に直面している。本研究では,安定かつ制御可能な拡散ポリシー最適化のための新しいRLアルゴリズムであるDIPOLE(Dichotomous diffusion Policy Improvement)を提案する。我々は、拡散政策抽出に望ましい重み付け回帰目標を提供するRLにおいて、KL規則化された目的を再考することから始めるが、しばしば欲求性と安定性のバランスをとるのに苦労する。次に、最適政策を安定的に学習された2つの二コトプスポリシーに自然に分解できる厳格化政策正規化スキームを定式化し、一方は報酬の最大化、もう一方は報酬の最小化に焦点をあてる。このような設計の下では、推論中にディコトプスポリシーのスコアを線形に組み合わせて最適化されたアクションを生成することができ、これにより、オフラインおよびオフラインのRL設定におけるExORLとOGBenchによる評価が、我々のアプローチの有効性を示す。また、DIPOLEを使用して、エンド・ツー・エンドの自動運転(AD)のための大規模な視覚言語アクション(VLA)モデルをトレーニングし、大規模な実世界のADベンチマークであるNAVSIMで評価し、複雑な実世界のアプリケーションの可能性を強調します。

論文の概要: Dichotomous Diffusion Policy Optimization

関連論文リスト