A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models
- URL: http://arxiv.org/abs/2511.15098v1
- Date: Wed, 19 Nov 2025 04:13:36 GMT
- Title: A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models
- Authors: Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, Shijian Lu,
- Abstract summary: We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
- Score: 85.30893355216486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs incur significant computational overhead during inference due to the full-sequence attention computation in each denoising step. Pioneer studies attempt to resolve this issue from a modality-agnostic perspective via key-value cache optimization or efficient sampling but most of them overlook modality-specific visual token redundancy. In this work, we conduct a comprehensive study on how visual token redundancy evolves with different dMLLM architectures and tasks and how visual token pruning affects dMLLM responses and efficiency. Specifically, our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks. In addition, we validate that visual token pruning introduces non-negligible information loss in dMLLMs and only from-scratch dMLLMs can recover the lost information progressively during late denoising steps. Furthermore, our study shows that layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs. Overall, this work offers a new perspective on efficiency optimization for dMLLMs, greatly advancing their applicability across various multimodal understanding tasks.
Related papers
- Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs [10.443777669301983]
Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval.<n>But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs.<n>We propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding.
arXiv Detail & Related papers (2026-02-05T04:01:01Z) - Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models [58.91911788912665]
We propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discrimi visual representations.<n>Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information.
arXiv Detail & Related papers (2025-12-06T04:20:13Z) - WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens [69.97021957331326]
We propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization.<n>We also introduce a VAE branch with linear projection to recover fine-grained image details.
arXiv Detail & Related papers (2025-12-02T09:02:20Z) - Sparse Training Scheme for Multimodal LLM [26.81140959413325]
Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variety of domains.<n>We propose a novel training-efficient framework based on sparse representations, termed the Sparse Training Scheme (STS)<n>This scheme consists of two key components: the Visual Token, which reduces the information load by compressing visual tokens, and the Layer Dynamic Skipper, which mitigates the computational overhead by skipping unnecessary layers in the language model during both forward and backward passes.
arXiv Detail & Related papers (2025-09-16T11:33:20Z) - Visual Instruction Bottleneck Tuning [11.675789515164269]
multimodal large language models (MLLMs) suffer performance degradation when encountering unfamiliar queries under distribution shifts.<n>We take an alternative approach to enhance the generalization and robustness of MLLMs from a representation learning perspective.<n>Inspired by information bottleneck (IB) principle, we derive a variational lower bound of the IB for MLLMs and devise a practical implementation, Visual Instruction Bottleneck Tuning (Vittle)
arXiv Detail & Related papers (2025-05-20T05:24:53Z) - Grounded Chain-of-Thought for Multimodal Large Language Models [66.04061083611863]
We propose a new learning task for multimodal large language models (MLLMs) called Grounded Chain-of-Thought (GCoT)<n>GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis.<n>To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images.
arXiv Detail & Related papers (2025-03-17T04:07:47Z) - Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs [62.9348974370985]
We propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost.<n>Our approach is motivated by the key observations that, MLLM's unreasonable attention distribution causes features to be dominated by historical output tokens.<n>Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM's reliance on language priors.
arXiv Detail & Related papers (2025-03-11T11:52:37Z) - FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance [9.782362715017596]
We introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence.<n>We analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy.<n>FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens.
arXiv Detail & Related papers (2025-01-05T03:28:45Z) - Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [66.04061083611863]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.<n>We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)<n>DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z) - LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [72.68665884790002]
We propose a novel framework to transfer knowledge from l-MLLMs to s-MLLMs.<n>We introduce Multimodal Distillation (MDist) to transfer teacher model's robust representations across both visual and linguistic modalities.<n>We also propose a three-stage training scheme to fully exploit the potential of the proposed distillation strategy.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs [14.533229831531168]
We introduce a new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance.<n>Inspired by human attention patterns in Visual Question Answering (VQA) tasks, TRIM presents a fresh perspective on the selection and reduction of image tokens.<n>The results demonstrate a significant reduction in computational overhead while maintaining a consistent level of performance.
arXiv Detail & Related papers (2024-09-17T08:56:27Z) - CoMMIT: Coordinated Multimodal Instruction Tuning [90.1532838391285]
Multimodal large language models (MLLMs) generally involve cooperative learning between a backbone LLM and a feature encoder of non-text input modalities.<n>In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives.<n>We propose a Multimodal Balance Coefficient that enables quantitative measurement of the balance of learning.
arXiv Detail & Related papers (2024-07-29T23:18:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.