Fugu-MT 論文翻訳(概要): BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

論文の概要: BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

arxiv url: http://arxiv.org/abs/2605.13382v1
Date: Wed, 13 May 2026 11:37:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:28.017137
Title: BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
Title（参考訳）: BlockVLA: Block Diffusion Finetuningによる自動回帰VLAの高速化
Authors: Ruiheng Wang, Shuanghao Bai, Haoran Zhang, Badong Chen, Xiangyu Xu,
Abstract要約: BlockVLAは、事前訓練されたARバックボーンを効率的な離散拡散ポリシーに適合させるフレームワークである。 LIBERO と SimplerEnv のベンチマークを広範囲に評価する。本モデルでは, トレーニング効率が向上し, 成功率がベースラインよりもかなり高速に収束する。
参考スコア（独自算出の注目度）: 41.5997751218601
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While autoregressive (AR) Vision-Language-Action (VLA) models have demonstrated formidable reasoning capabilities in robotic tasks, their sequential decoding process often incurs high inference latency and may amplify error accumulation during long-horizon execution. Discrete Diffusion Language Models (dLLMs) provide a promising alternative through parallel token refinement, but their practical deployment in robotics remains limited by repeated denoising function evaluations (NFEs) and the difficulty of directly applying standard KV caching to bidirectional iterative decoding. To bridge these paradigms, we propose BlockVLA, a framework that adapts pretrained AR backbones into an efficient discrete diffusion policy through a block diffusion paradigm. BlockVLA maintains autoregressive dependencies at the block level while enabling parallel denoising within each block, thereby combining global causal coherence with local parallel generation. This design enables prefix KV-cache reuse across completed blocks, reduces the effective cost of iterative denoising, and provides a smoother transition from AR pretraining to diffusion-based policy fine-tuning. We conduct extensive evaluations on the LIBERO and SimplerEnv benchmarks. Experimental results demonstrate that our BlockVLA achieves a 3.3$\times$ inference acceleration over standard discrete diffusion baselines. Furthermore, our model exhibits superior training efficiency, with success rates converging substantially faster than baselines, a gain that is particularly pronounced in complex, long-horizon tasks, where BlockVLA achieves significant performance gains in the early stages of training. This work establishes Block Diffusion as a robust bridge between large-scale pretrained AR models and efficient, high-frequency real-time robotic control.
Abstract（参考訳）: オートレグレッシブ(AR)ビジョン・ランゲージ・アクション(VLA)モデルは、ロボットタスクにおいて重大な推論能力を示したが、シーケンシャルデコーディングプロセスは、しばしば高い推論遅延を引き起こし、長時間水平実行中にエラーの蓄積を増幅する可能性がある。離散拡散言語モデル(dLLMs)は、並列トークンの改良を通じて有望な代替手段を提供するが、ロボットへの実践的な展開は、繰り返しデノナイジング関数評価(NFEs)と、標準KVキャッシュを双方向反復復号法に直接適用することの難しさによって制限されている。これらのパラダイムを橋渡しするために,事前学習されたARバックボーンをブロック拡散パラダイムを介して効率的な離散拡散ポリシーに適応させるフレームワークであるBlockVLAを提案する。 BlockVLAはブロックレベルで自己回帰的依存関係を維持しつつ、各ブロック内で並列化を可能にし、グローバル因果コヒーレンスと局所並列生成を組み合わせる。この設計により、完了したブロック間でのプレフィックスKV-cacheの再利用が可能となり、反復復調の効果的なコストを低減し、ARプリトレーニングから拡散ベースのポリシー微調整へのスムーズな移行を提供する。 LIBERO と SimplerEnv のベンチマークを広範囲に評価する。実験の結果、BlockVLAは標準の離散拡散基底線上で3.3$\times$推論加速度を達成することが示された。さらに,BlockVLAが訓練の初期段階において顕著な性能向上を達成できるような,複雑で長期のタスクにおいて特に顕著な成功率であるベースラインよりもかなり高速に収束し,トレーニング効率が向上することを示す。この研究はBlock Diffusionを、大規模な事前訓練されたARモデルと、効率的で高周波なリアルタイムロボット制御の間の堅牢なブリッジとして確立している。

論文の概要: BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

関連論文リスト