Related papers: D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs

D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs

URL: http://arxiv.org/abs/2511.12280v1
Date: Sat, 15 Nov 2025 16:24:12 GMT
Title: D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs
Authors: Shuochen Chang, Xiaofeng Zhang, Qingyang Liu, Li Niu,
Abstract summary: Diffusion-based multimodal large language models (Diffusion MLLMs) exhibit substantially slower inference than autoregressive models.<n>We propose D$3$ToM, a Decider-guided dynamic token merging method to accelerate inference in Diffusion MLLMs.<n>Experiments show that D$3$ToM accelerates inference while preserving competitive performance.
Score: 22.78575203353886
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion-based multimodal large language models (Diffusion MLLMs) have recently demonstrated impressive non-autoregressive generative capabilities across vision-and-language tasks. However, Diffusion MLLMs exhibit substantially slower inference than autoregressive models: Each denoising step employs full bidirectional self-attention over the entire sequence, resulting in cubic decoding complexity that becomes computationally impractical with thousands of visual tokens. To address this challenge, we propose D$^{3}$ToM, a Decider-guided dynamic token merging method that dynamically merges redundant visual tokens at different denoising steps to accelerate inference in Diffusion MLLMs. At each denoising step, D$^{3}$ToM uses decider tokens-the tokens generated in the previous denoising step-to build an importance map over all visual tokens. Then it maintains a proportion of the most salient tokens and merges the remainder through similarity-based aggregation. This plug-and-play module integrates into a single transformer layer, physically shortening the visual token sequence for all subsequent layers without altering model parameters. Moreover, D$^{3}$ToM employs a merge ratio that dynamically varies with each denoising step, aligns with the native decoding process of Diffusion MLLMs, achieving superior performance under equivalent computational budgets. Extensive experiments show that D$^{3}$ToM accelerates inference while preserving competitive performance. The code is released at https://github.com/bcmi/D3ToM-Diffusion-MLLM.

Related papers

Divide and Conquer: Accelerating Diffusion-Based Large Language Models via Adaptive Parallel Decoding [6.755667885643806]
Diffusion-based large language models (dLLMs) have shown promising performance across various reasoning tasks.<n>We introduce an adaptive parallel decoding approach, namely DiCo, which features a three-phase divide-and-conquer paradigm.<n>Extensive experiments demonstrate that DiCo can achieve significant inference speedups while maintaining competitive generation quality.
arXiv Detail & Related papers (2026-02-27T08:36:06Z)
Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models [96.0074341403456]
Inference-time compute has re-emerged as a practical way to improve LLM reasoning.<n>Most test-time scaling (TTS) algorithms rely on autoregressive decoding.<n>We propose Prism, an efficient TTS framework for dLLMs.
arXiv Detail & Related papers (2026-02-02T09:14:51Z)
Beyond Hard Masks: Progressive Token Evolution for Diffusion Language Models [46.151072011636444]
EvoToken-DLM is a novel diffusion-based language modeling approach that replaces hard binary masks with evolving soft token distributions.<n>EvoToken-DLM consistently achieves superior performance, outperforming strong diffusion-based and masked DLM baselines.
arXiv Detail & Related papers (2026-01-12T09:25:14Z)
LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs [52.24096832965001]
We present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method.<n>The PVC method can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding.<n>Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x.
arXiv Detail & Related papers (2025-11-26T08:11:10Z)
$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs [26.779915891040236]
We propose emphVisiPruner, a training-free pruning framework that reduces up to 99% of vision-related attention computations and 53.9% of FLOPs on LLaVA-v1.5 7B.<n>Our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics.
arXiv Detail & Related papers (2025-10-20T06:40:17Z)
Diffusion Language Models Know the Answer Before Decoding [56.96815863705218]
Diffusion language models (DLMs) have emerged as an alternative to autoregressive approaches.<n>Our work highlights and leverage an overlooked property of DLMs early answer convergence.<n>We introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding.
arXiv Detail & Related papers (2025-08-27T15:40:25Z)
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)<n>Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.<n>Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z)
ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming [14.937905258757635]
$textbfST3$ is a framework designed to accelerate MLLM inference without retraining.<n>$textbfST3$ can be seamlessly integrated into existing pre-trained MLLMs.
arXiv Detail & Related papers (2024-12-28T10:17:29Z)
Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video)<n>We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [66.04061083611863]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.<n>We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)<n>DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.