Fugu-MT 論文翻訳(概要): dVLA-RL: Reinforcement Learning over Denoising Trajectories for Discrete Diffusion Vision-Language-Action Models

論文の概要: dVLA-RL: Reinforcement Learning over Denoising Trajectories for Discrete Diffusion Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2606.23623v1
Date: Mon, 22 Jun 2026 17:19:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 17:40:38.764333
Title: dVLA-RL: Reinforcement Learning over Denoising Trajectories for Discrete Diffusion Vision-Language-Action Models
Title（参考訳）: dVLA-RL:離散拡散ビジョン・ランゲージ・アクションモデルのためのデノイング軌道上の強化学習
Authors: Yuhao Wu, Yitian Liu, Weijie Shen, Mishuo Han, Wenjie Xu, Haotian Liang, Zhongshan Liu, Yinan Mao, Lei Xu, Xinping Guan, Ru Ying, Ran Zheng, Wei Sui, Xiaokang Yang, Wenbo Ding, Yao Mu,
Abstract要約: 我々は,学習目標を限界行動確率からサンプル生成経路の結合確率にシフトするtextbfdVLA-RLを提案する。本手法は, LIBEROにおける textbf99.7% の成功率を達成する。また、SFTベースラインに対してtextbf30.6%の改善を提供することで、RoboTwin 2.0上でのVLAベースの強力な結果も確立している。
参考スコア（独自算出の注目度）: 49.497309561043004
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models have established a powerful paradigm for generalist robotic manipulation by grounding control into the semantic reasoning of VLMs. Prevailing architectures typically model actions continuously via diffusion or flow processes, or discretely through either autoregressive generation or parallel decoding. Recently, Discrete Diffusion VLAs (dVLAs) have emerged as a distinct alternative, unifying vision, language, and action into a single discrete token space via masked generative modeling. While combining iterative refinement with unified representations, its training has thus far been restricted to Supervised Fine-Tuning (SFT), leaving the potential of Reinforcement Learning (RL) for further policy refinement largely unexplored. A fundamental challenge in RL for dVLAs is that the marginal probability of the final action generated by dVLAs remains intractable. To solve this problem, we propose \textbf{dVLA-RL}, shifting the learning objective from the marginal action probability to the joint probability of the sampled generation path. Specifically, by modeling the denoising process as a Markov Decision Process (MDP), we mathematically formulate this path probability as a product of step-wise transitions. This trajectory-level objective provides a unified formulation that natively accommodates variable denoising steps. Leveraging this intrinsic fexibility, we introduce a unified step scheduling approach for complex multi-task learning, tailoring denoising steps to specific task complexities to maximize both success rates and computational effciency. Extensive evaluations demonstrate that our approach achieves a success rate of \textbf{99.7\%} on LIBERO. Furthermore, it establishes strong VLA-based results on RoboTwin 2.0 by delivering a \textbf{30.6\%} improvement over the SFT baseline, remaining competitive with strong World-Action Model baselines.
Abstract（参考訳）: Vision-Language-Action(VLA)モデルは、VLMのセマンティック推論に制御を基盤として、汎用的なロボット操作のための強力なパラダイムを確立している。一般的なアーキテクチャは、拡散またはフロープロセスを介して、または自己回帰生成または並列復号のいずれかを通じて、アクションを連続的にモデル化する。近年、離散拡散VLA(dVLA)は、視覚、言語、アクションをマスク付き生成モデルにより単一の離散トークン空間に統一する、明確な代替手段として出現している。反復的な改良と統一された表現を組み合わせる一方で、その訓練はこれまでのところスーパーバイザード・ファイン・チューニング(SFT)に限られており、さらなる政策改善のための強化学習(RL)の可能性はほとんど探索されていない。 dVLA に対する RL の基本的な課題は、dVLA が生成する最終作用の限界確率が難解であることである。そこで本研究では,学習目標を限界動作確率からサンプル生成経路の結合確率にシフトする,‘textbf{dVLA-RL}’を提案する。具体的には、偏極過程をマルコフ決定過程(MDP)としてモデル化することにより、この経路確率をステップワイズ遷移の積として数学的に定式化する。この軌道レベルの目的は、変数の分解ステップをネイティブに許容する統一的な定式化を提供する。この本質的なフィージビリティを活用することで、複雑なマルチタスク学習のための統一的なステップスケジューリングアプローチを導入し、成功率と計算効率の両方を最大化するために、特定のタスク複雑度へのステップの調整を行う。その結果,LIBERO 上での textbf{99.7\%} の成功率が得られた。さらに、SFTベースラインに対してtextbf{30.6\%}の改善を提供し、強力なWorld-Action Modelベースラインと競合し続けることで、RoboTwin 2.0上で強力なVLAベースの結果を確立する。

論文の概要: dVLA-RL: Reinforcement Learning over Denoising Trajectories for Discrete Diffusion Vision-Language-Action Models

関連論文リスト