VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
- URL: http://arxiv.org/abs/2510.16598v1
- Date: Sat, 18 Oct 2025 17:54:18 GMT
- Title: VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
- Authors: Jiaying Zhu, Yurui Zhu, Xin Lu, Wenrui Yan, Dong Li, Kunlin Liu, Xueyang Fu, Zheng-Jun Zha,
- Abstract summary: Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks.<n>Previous token compression techniques are often constrained by rules that risk discarding critical information.<n>We reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process.
- Score: 82.72388893596555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks from the massive number of visual tokens generated by high-resolution images or multi-image inputs. Previous token compression techniques are often constrained by heuristic rules that risk discarding critical information. They may suffer from biases, such as attention sinks, that lead to sharp performance drops under aggressive compression ratios. To address these limitations, we reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process. To be specific, we propose VisionSelector, a scorer module decoupled from the MLLM backbone that incorporates a differentiable Top-K mechanism and a curriculum annealing strategy to bridge the training-inference gap, enabling efficient and adaptive token selection various arbitrary compression rates. Remarkably lightweight with only 12.85M trainable parameters, VisionSelector demonstrates generalization across various compression rates and adaptively identifying critical tokens. This leads to superior performance across all compression budgets, evidenced by preserving 100% accuracy on MME with 30% retention budget, outperforming prior methods by 12.14% at 10% retention budget, and doubling prefill speed. Our code is available at https://github.com/JulietChoo/VisionSelector .
Related papers
- ApET: Approximation-Error Guided Token Compression for Efficient VLMs [16.4657793751671]
We present ApET, an Approximation-Error guided Token compression framework.<n>We show that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks.<n>Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference and making VLM deployment more practical.
arXiv Detail & Related papers (2026-02-23T14:15:37Z) - Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models [34.12135666939555]
Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all layers.<n>We introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism.<n>ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance.
arXiv Detail & Related papers (2026-02-13T04:49:27Z) - Arbitrary Ratio Feature Compression via Next Token Prediction [52.10426317889982]
Arbitrary Ratio Feature Compression (ARFC) framework supports any compression ratio with a single model.<n>ARC is an auto-regressive model that performs compression via next-gressive prediction.<n>MoS module refines the compressed tokens by utilizing multiple compression results.<n>ERGC is integrated into the training process to preserve semantic and structural relationships during compression.
arXiv Detail & Related papers (2026-02-12T02:38:57Z) - Compressing Many-Shots in In-Context Learning [61.231471139896506]
We study an approach to improve the memory and computational efficiency of ICL inference by compressing the many-shot prompts.<n>We first show that existing prompt compression methods are ineffective for many-shot compression.<n>We propose MemCom, a layer-wise compression method.
arXiv Detail & Related papers (2025-10-17T16:57:42Z) - MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding [13.02027465520324]
We propose MARC, which integrates structured retrieval and RL-based distillation.<n>MARC achieves near-baseline accuracy using only one frame's tokens.<n>This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings.
arXiv Detail & Related papers (2025-10-09T08:07:19Z) - LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models [62.240460476785934]
We propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder.<n>LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts.
arXiv Detail & Related papers (2025-07-03T03:42:54Z) - Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.<n>We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z) - Vision-centric Token Compression in Large Language Model [51.92055188780033]
Vision Centric Token Compression (Vist) is a slow-fast compression framework that mirrors human reading.<n>On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%.
arXiv Detail & Related papers (2025-02-02T13:10:06Z) - Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models [21.36437021964681]
"Global Compression Commander" is a novel token compression framework for HR-LVLMs.<n>GlobalCom$2$ maintains over 90% performance while compressing 90% visual tokens, reducing FLOPs and peak memory to 9.1% and 60%.
arXiv Detail & Related papers (2025-01-09T11:57:58Z) - DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models [28.379533608574814]
We present DyCoke, a training-free token compression method to optimize token representation and accelerate video large language models.<n>DyCoke incorporates a plug-and-play temporal compression module to minimize temporal redundancy by merging redundant tokens across frames.<n>It ensures high-quality inference by dynamically retaining the critical tokens at each decoding step.
arXiv Detail & Related papers (2024-11-22T15:55:19Z) - VoCo-LLaMA: Towards Vision Compression with Large Language Models [31.398537194299752]
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window.<n>We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs.<n>Our method achieves minimal performance loss with a compression ratio of 576$times$, resulting in up to 94.8$%$ fewer FLOPs and 69.6$%$ acceleration in inference time.
arXiv Detail & Related papers (2024-06-18T05:05:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.