A Survey on Diffusion Language Models
- URL: http://arxiv.org/abs/2508.10875v1
- Date: Thu, 14 Aug 2025 17:47:22 GMT
- Title: A Survey on Diffusion Language Models
- Authors: Tianyi Li, Mingda Chen, Bowei Guo, Zhiqiang Shen,
- Abstract summary: Diffusion Language Models (DLMs) are an alternative to the dominant autoregressive (AR) paradigm.<n>DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context.<n>Recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts.
- Score: 30.00199970146068
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.
Related papers
- WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens [69.97021957331326]
We propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization.<n>We also introduce a VAE branch with linear projection to recover fine-grained image details.
arXiv Detail & Related papers (2025-12-02T09:02:20Z) - A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z) - Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models [82.87985794856803]
Large Language Models (LLMs) have achieved state-of-the-art performance on a broad range of Natural Language Processing (NLP) tasks.<n>Recently, Diffusion Language Models (DLMs) have emerged as a promising alternative architecture.
arXiv Detail & Related papers (2025-10-05T10:50:52Z) - The Evolution of Video Anomaly Detection: A Unified Framework from DNN to MLLM [27.800308082023285]
Video anomaly detection (VAD) aims to identify and ground anomalous behaviors or events in videos.<n>The continuous evolution of deep model architectures has driven innovation in VAD methodologies.<n>The rapid development of multi-modal large language (MLLMs) and large language models (LLMs) has introduced new opportunities and challenges to the VAD field.
arXiv Detail & Related papers (2025-07-29T10:07:24Z) - Discrete Diffusion in Large Language and Multimodal Models: A Survey [56.31088116526825]
We provide a systematic survey of Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs)<n>Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token, parallel decoding paradigm.<n>We trace the historical development of dLLMs and dMLLMs, formalize the underlying mathematical frameworks, and categorize representative models.
arXiv Detail & Related papers (2025-06-16T17:59:08Z) - DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs [5.074812070492738]
We introduce DaMO, a data-efficient Video LLM specifically designed for accurate temporal reasoning and multimodal understanding.<n>We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities.<n>Our work establishes a promising direction for data-efficient video-language modeling.
arXiv Detail & Related papers (2025-06-13T08:13:05Z) - DLM-One: Diffusion Language Models for One-Step Sequence Generation [63.43422118066493]
DLM-One is a score-distillation-based framework for one-step sequence generation with continuous diffusion language models.<n>We investigate whether DLM-One can achieve substantial gains in sampling efficiency for language modeling.
arXiv Detail & Related papers (2025-05-30T22:42:23Z) - Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model [63.14883657299359]
Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering.<n> tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert, where distribution shifts between pre-training and target datasets constrain target performance, and OpenWorld Stabilization, where catastrophic forgetting erases the model general knowledge.
arXiv Detail & Related papers (2025-03-06T15:29:13Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.