LaViDa: A Large Diffusion Language Model for Multimodal Understanding
- URL: http://arxiv.org/abs/2505.16839v3
- Date: Wed, 18 Jun 2025 15:17:40 GMT
- Title: LaViDa: A Large Diffusion Language Model for Multimodal Understanding
- Authors: Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, Aditya Grover,
- Abstract summary: LaViDa is a family of Vision-Language Models built on Discrete diffusion models.<n>DMs offer parallel decoding for faster inference and bidirectional context for controllable generation.<n>LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks.
- Score: 70.99233885354028
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.
Related papers
- LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning [71.98260064022452]
We introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models.<n>Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and connector that projects visual features into the language embedding space.
arXiv Detail & Related papers (2025-05-22T17:23:26Z) - Speculative Decoding Reimagined for Multimodal Large Language Models [48.115777709178595]
This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference.<n>Experiments show that MSD boosts inference speed by up to $2.29times$ for LLaVA-1.5-7B and up to $2.46times$ for LLaVA-1.5-13B on multimodal benchmarks.
arXiv Detail & Related papers (2025-05-20T12:12:17Z) - MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention [61.025422435235456]
MMInference is a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs.<n>We show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy.
arXiv Detail & Related papers (2025-04-22T17:59:51Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large
Language Models [77.2078051555533]
We propose a novel and affordable solution for the effective VL adaption of large language models (LLMs)
Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters.
MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions.
arXiv Detail & Related papers (2023-05-24T11:06:15Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.