Fugu-MT 論文翻訳(概要): Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

論文の概要: Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

arxiv url: http://arxiv.org/abs/2604.06832v1
Date: Wed, 08 Apr 2026 08:50:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.436631
Title: Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM
Title（参考訳）: 高速dVLM:自己回帰VLMからの直接変換による効率的なブロック拡散VLM
Authors: Chengyue Wu, Shiyi Lan, Yonggan Fu, Sensen Gao, Jin Wang, Jincheng Yu, Jose M. Alvarez, Pavlo Molchanov, Ping Luo, Song Han, Ligeng Zhu, Enze Xie,
Abstract要約: 我々は,KV-cache互換並列デコードと推測ブロックデコードが可能なブロック拡散型VLMであるFast-dVLMを提案する。 SGLangの統合とFP8量子化により、Fast-dVLMはARベースライン上でのエンドツーエンドの推論速度を6倍以上に向上する。
参考スコア（独自算出の注目度）: 58.322826487307765
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one, making AR decoding memory-bandwidth-bound and leaving hardware parallelism underutilized. While block-wise discrete diffusion has shown promise for parallel text generation, extending it to VLMs remains challenging due to the need to jointly handle continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities. We present Fast-dVLM, a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. We systematically compare two AR-to-diffusion conversion strategies: a two-stage approach that first adapts the LLM backbone with text-only diffusion fine-tuning before multimodal training, and a direct approach that converts the full AR VLM in one stage. Under comparable training budgets, direct conversion proves substantially more efficient by leveraging the already multimodally aligned VLM; we therefore adopt it as our recommended recipe. We introduce a suite of multimodal diffusion adaptations, block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation, that collectively enable effective block diffusion in the VLM setting. Extensive experiments across 11 multimodal benchmarks show Fast-dVLM matches its autoregressive counterpart in generation quality. With SGLang integration and FP8 quantization, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.
Abstract（参考訳）: 視覚言語モデル(VLM)は主に自動回帰デコーディングに依存しており、トークンを1回ずつ生成し、推論スループットを根本的に制限する。この制限は、ロボティクスや自律運転といった物理的なAIシナリオでは特に深刻で、VLMはバッチサイズ1でエッジデバイスにデプロイされる。ブロック単位の離散拡散は、並列テキスト生成を約束しているが、連続的な視覚表現と離散テキストトークンを同時に扱う必要があるため、事前訓練されたマルチモーダル能力を保ちながら、VLMに拡張することは依然として困難である。我々は,KV-cache互換並列デコードと推測ブロックデコードが可能なブロック拡散型VLMであるFast-dVLMを提案する。 2段階のAR-to-diffusion変換戦略を,まずLLMバックボーンにテキストのみの拡散微調整を施した2段階のアプローチと,フルAR VLMを1段階の直接的アプローチとを体系的に比較した。同等のトレーニング予算の下では,すでにマルチモーダルなVLMを活用することで,直接変換の効率が著しく向上する。本稿では,マルチモーダル拡散適応,ブロックサイズアニーリング,因果コンテキストアテンション,オートトランケーションマスキング,視覚効率の両立などを紹介する。 11のマルチモーダルベンチマークに対する大規模な実験は、Fast-dVLMが生成品質の自己回帰と一致していることを示している。 SGLangの統合とFP8量子化により、Fast-dVLMはARベースライン上でのエンドツーエンドの推論速度を6倍以上に向上する。

論文の概要: Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

関連論文リスト