DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
- URL: http://arxiv.org/abs/2512.15713v2
- Date: Wed, 24 Dec 2025 03:37:34 GMT
- Title: DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
- Authors: Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang,
- Abstract summary: diffusion vision language model (dVLM) still lags significantly behind that of mainstream models.<n>We propose DiffusionVL, a dVLM family that could be translated from any powerful AR models.<n>DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog) bench-alongside a 2x inference speedup.
- Score: 43.99949601044522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.
Related papers
- Analyzing Diffusion and Autoregressive Vision Language Models in Multimodal Embedding Space [52.34072027212278]
Embedding models are a fundamental component of modern AI systems such as semantic search and retrieval-augmented generation.<n>Recent advances in large foundation models have substantially accelerated the development of embedding models.<n>We present the first systematic study of converting Multimodal dLLMs into embedding models.
arXiv Detail & Related papers (2026-01-19T06:51:15Z) - Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone [6.76700377196741]
We introduce DiffuApriel, a masked diffusion language model built on a bidirectional Mamba backbone.<n>Our results demonstrate that bidirectional state-space architectures serve as strong denoisers in masked diffusion LMs.
arXiv Detail & Related papers (2025-11-19T23:23:49Z) - Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies [62.653984010274485]
Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions.<n> prevailingAs either generate actions auto-regressively in a fixed left-to-right order or attach separate or diffusion heads outside the backbone.<n>We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion.
arXiv Detail & Related papers (2025-08-27T17:39:11Z) - Diffusion Beats Autoregressive in Data-Constrained Settings [50.56893491038853]
Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks.<n>Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored.<n>We systematically study masked diffusion models in data-constrained settings where training involves repeated passes over limited data.<n>Our results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm.
arXiv Detail & Related papers (2025-07-21T17:59:57Z) - LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning [71.98260064022452]
We introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models.<n>Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and connector that projects visual features into the language embedding space.
arXiv Detail & Related papers (2025-05-22T17:23:26Z) - Scaling Diffusion Language Models via Adaptation from Autoregressive Models [105.70889434492143]
Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling.<n>We show that we can convert AR models ranging from 127M to 7B parameters into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training.<n>Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts.
arXiv Detail & Related papers (2024-10-23T14:04:22Z) - Transfer Learning for Text Diffusion Models [16.97230119564891]
We explore the potential for text diffusion to replace autoregressive (AR) decoding for the training and deployment of large language models (LLMs)
We use a lightweight adaptation procedure we call AR2Diff'' to transform pretrained AR models into text diffusion models.
arXiv Detail & Related papers (2024-01-30T17:11:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.