Related papers: Towards Lossless Ultimate Vision Token Compression for VLMs

Towards Lossless Ultimate Vision Token Compression for VLMs

URL: http://arxiv.org/abs/2512.09010v1
Date: Tue, 09 Dec 2025 15:40:13 GMT
Title: Towards Lossless Ultimate Vision Token Compression for VLMs
Authors: Dehua Zheng, Mouxiao Huang, Borui Jiang, Hailin Hu, Xinghao Chen,
Abstract summary: Lossless Ultimate Vision tokens Compression (LUVC) framework is proposed.<n>LUVC compresses visual tokens until complete elimination at the final layer of language model.<n>Experiments show that LUVC achieves a 2 speedup inference in language model with negligible accuracy degradation.
Score: 11.485425012979052
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual language models encounter challenges in computational efficiency and latency, primarily due to the substantial redundancy in the token representations of high-resolution images and videos. Current attention/similarity-based compression algorithms suffer from either position bias or class imbalance, leading to significant accuracy degradation. They also fail to generalize to shallow LLM layers, which exhibit weaker cross-modal interactions. To address this, we extend token compression to the visual encoder through an effective iterative merging scheme that is orthogonal in spatial axes to accelerate the computation across the entire VLM. Furthermoer, we integrate a spectrum pruning unit into LLM through an attention/similarity-free low-pass filter, which gradually prunes redundant visual tokens and is fully compatible to modern FlashAttention. On this basis, we propose Lossless Ultimate Vision tokens Compression (LUVC) framework. LUVC systematically compresses visual tokens until complete elimination at the final layer of LLM, so that the high-dimensional visual features are gradually fused into the multimodal queries. The experiments show that LUVC achieves a 2 speedup inference in language model with negligible accuracy degradation, and the training-free characteristic enables immediate deployment across multiple VLMs.

Related papers

Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation [51.743225614196774]
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language reasoning.<n>They remain vulnerable to hallucination, where generated content deviates from visual evidence.<n>Recent vision enhancement methods attempt to address this issue by reinforcing visual tokens during decoding.<n>We propose Adaptive Visual Reinforcement (AIR), a training-free framework for MLLMs.
arXiv Detail & Related papers (2026-02-27T14:18:51Z)
Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models [34.12135666939555]
Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all layers.<n>We introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism.<n>ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance.
arXiv Detail & Related papers (2026-02-13T04:49:27Z)
Adaptive-VoCo: Complexity-Aware Visual Token Compression for Vision-Language Models [19.536595270049016]
We propose Adaptive-VoCo, a framework that augments VoCo-LLaMA with a lightweight predictor for adaptive compression.<n> Experimental results show that our method consistently outperforms fixed-rate baselines across multiple multimodal tasks.
arXiv Detail & Related papers (2025-12-20T20:24:07Z)
Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference [68.4758228017823]
ParVTS partitions visual tokens into subject and non-subject groups, processes them in parallel to transfer their semantics into question tokens, and discards the non-subject path mid-inference.<n>Experiments show that ParVTS prunes up to 88.9% of visual tokens with minimal performance drop, achieving 1.77x speedup and 70% FLOPs reduction.
arXiv Detail & Related papers (2025-11-24T08:29:36Z)
A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z)
Variation-aware Vision Token Dropping for Faster Large Vision-Language Models [24.952668143243542]
Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks.<n> Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency.<n>We propose Variation-aware Vision Token Dropping (textiti.e., textbfV$2$Drop), which progressively removes visual tokens with minimal variation during LVLM inference.
arXiv Detail & Related papers (2025-09-01T15:28:44Z)
HoliTom: Holistic Token Merging for Fast Video Large Language Models [32.620504076794795]
Video language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens.<n>We introduce HoliTom, a novel training-free holistic token framework.<n>We also introduce a robust inner-LLM token similarity-based merging approach, designed for superior performance and compatibility with outer-LLM pruning.
arXiv Detail & Related papers (2025-05-27T15:28:45Z)
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)<n>Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.<n>Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.<n> compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.<n>We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models [28.379533608574814]
We present DyCoke, a training-free token compression method to optimize token representation and accelerate video large language models.<n>DyCoke incorporates a plug-and-play temporal compression module to minimize temporal redundancy by merging redundant tokens across frames.<n>It ensures high-quality inference by dynamically retaining the critical tokens at each decoding step.
arXiv Detail & Related papers (2024-11-22T15:55:19Z)
Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.<n>To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.<n>We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.