Related papers: DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

URL: http://arxiv.org/abs/2602.18846v1
Date: Sat, 21 Feb 2026 14:22:49 GMT
Title: DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
Authors: Aditya Kumar Singh, Hitesh Kandala, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum,
Abstract summary: DUET-VLM is a versatile plug-and-play dual compression framework.<n>It enables robust adaptation to reduced visual (image/video) input without sacrificing accuracy.<n>Our results highlight end-to-end training with DUET-VLM.
Score: 14.714791872881397
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens, followed by (b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains >97% even at 89% reduction. With this dual-stage compression during training, it achieves 99.7% accuracy at 67% and 97.6% at 89%, surpassing prior SoTA visual token reduction methods across multiple benchmarks. When integrated into Video-LLaVA-7B, it even surpasses the baseline -- achieving >100% accuracy with a substantial 53.1% token reduction and retaining 97.6% accuracy under an extreme 93.4% setting. These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget. Our code is available at https://github.com/AMD-AGI/DUET-VLM.

Related papers

Stateful Token Reduction for Long-Video Hybrid VLMs [69.6930118088911]
We study query-conditioned token reduction for hybrid video vision-language models (VLMs)<n>We propose a low-to-high progressive reduction schedule and a unified language-aware scoring mechanism for both attention and Mamba blocks.<n>Under an aggressive compression setting, our approach delivers substantial prefilling speedups with near-baseline accuracy at test time.
arXiv Detail & Related papers (2026-02-27T08:11:06Z)
ApET: Approximation-Error Guided Token Compression for Efficient VLMs [16.4657793751671]
We present ApET, an Approximation-Error guided Token compression framework.<n>We show that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks.<n>Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference and making VLM deployment more practical.
arXiv Detail & Related papers (2026-02-23T14:15:37Z)
Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models [34.12135666939555]
Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all layers.<n>We introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism.<n>ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance.
arXiv Detail & Related papers (2026-02-13T04:49:27Z)
PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective [59.24570811503256]
We propose PIO-FVLM to reduce redundant visual tokens in vision-models (VLMs) to accelerate inference.<n>The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment.<n>On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance.
arXiv Detail & Related papers (2026-02-04T15:33:10Z)
D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning [49.16227597771663]
D2Pruner is a framework that combines debiased importance with a structural pruning mechanism.<n>It reduces FLOPs by 74.2% while retaining 99.2% of its original performance.<n>It marks a significant advancement with up to 63. 53% improvement over existing methods.
arXiv Detail & Related papers (2025-12-22T14:42:31Z)
Variation-aware Vision Token Dropping for Faster Large Vision-Language Models [24.952668143243542]
Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks.<n> Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency.<n>We propose Variation-aware Vision Token Dropping (textiti.e., textbfV$2$Drop), which progressively removes visual tokens with minimal variation during LVLM inference.
arXiv Detail & Related papers (2025-09-01T15:28:44Z)
METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding [55.38256656122857]
We propose METok, a training-free, Multi-stage Event-based Token compression framework.<n>We show METok achieves an optimal trade-off between efficiency and accuracy by dynamically selecting informative visual tokens.<n>For instance, equipping LongVA-7B with METok realizes an 80.6% FLOPs reduction and 93.5% KV Cache memory savings.
arXiv Detail & Related papers (2025-06-03T13:19:41Z)
VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models [35.38573641029626]
We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
arXiv Detail & Related papers (2025-03-21T09:46:31Z)
TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models [8.636574530055817]
TokenCarve is a training-free, plug-and-play, two-stage token compression framework.<n>It can even reduce the number of visual tokens to 22.2% of the original count, achieving a 1.23x speedup in inference, a 64% reduction in KV cache storage, and only a 1.54% drop in accuracy.
arXiv Detail & Related papers (2025-03-13T16:04:31Z)
Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.<n>To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.<n>We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference [45.11612407862277]
In vision-language models (VLMs), visual tokens usually bear a significant amount of computational overhead despite sparsity of information in them when compared to text tokens.<n>We propose a text-guided training-free token optimization mechanism dubbed SparseVLM that eliminates the need of extra parameters or fine-tuning costs.
arXiv Detail & Related papers (2024-10-06T09:18:04Z)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs. We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.