SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding
- URL: http://arxiv.org/abs/2512.14068v1
- Date: Tue, 16 Dec 2025 04:12:52 GMT
- Title: SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding
- Authors: Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, Bowen Zhou,
- Abstract summary: Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling.<n>We present textbfSDAR-VL, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding.<n>We show that SDAR-VL consistently improves emphtraining efficiency, emphconvergence stability, and emphtask performance over conventional block diffusion.
- Score: 25.2227348401136
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.
Related papers
- OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL [63.388513841293616]
Existing forgery detection methods fail to handle the interleaved text, images, and videos prevalent in real-world misinformation.<n>To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding.<n>We propose textbf OmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding.
arXiv Detail & Related papers (2026-02-11T09:41:36Z) - Causal Autoregressive Diffusion Language Model [70.7353007255797]
CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass.<n>Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation.
arXiv Detail & Related papers (2026-01-29T17:38:29Z) - Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z) - $\mathcal{E}_0$: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion [65.77755100137728]
We introduce E0, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens.<n>E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average.
arXiv Detail & Related papers (2025-11-26T16:14:20Z) - Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model [62.889356203346985]
We propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict.<n>DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods.<n>On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%.
arXiv Detail & Related papers (2025-10-31T16:32:12Z) - Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies [62.653984010274485]
Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions.<n> prevailingAs either generate actions auto-regressively in a fixed left-to-right order or attach separate or diffusion heads outside the backbone.<n>We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion.
arXiv Detail & Related papers (2025-08-27T17:39:11Z) - DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models [45.126261544696185]
Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to perturbations poses a significant threat to their reliability in real-world applications.<n>This paper introduces DiffCAP, a novel diffusion-based purification strategy that can effectively neutralize adversarial corruptions in VLMs.
arXiv Detail & Related papers (2025-06-04T13:26:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.