Adaptive-VoCo: Complexity-Aware Visual Token Compression for Vision-Language Models
- URL: http://arxiv.org/abs/2512.18496v1
- Date: Sat, 20 Dec 2025 20:24:07 GMT
- Title: Adaptive-VoCo: Complexity-Aware Visual Token Compression for Vision-Language Models
- Authors: Xiaoyang Guo, Keze Wang,
- Abstract summary: We propose Adaptive-VoCo, a framework that augments VoCo-LLaMA with a lightweight predictor for adaptive compression.<n> Experimental results show that our method consistently outperforms fixed-rate baselines across multiple multimodal tasks.
- Score: 19.536595270049016
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, large-scale vision-language models (VLMs) have demonstrated remarkable performance on multimodal understanding and reasoning tasks. However, handling high-dimensional visual features often incurs substantial computational and memory costs. VoCo-LLaMA alleviates this issue by compressing visual patch tokens into a few VoCo tokens, reducing computational overhead while preserving strong cross-modal alignment. Nevertheless, such approaches typically adopt a fixed compression rate, limiting their ability to adapt to varying levels of visual complexity. To address this limitation, we propose Adaptive-VoCo, a framework that augments VoCo-LLaMA with a lightweight predictor for adaptive compression. This predictor dynamically selects an optimal compression rate by quantifying an image's visual complexity using statistical cues from the vision encoder, such as patch token entropy and attention map variance. Furthermore, we introduce a joint loss function that integrates rate regularization with complexity alignment. This enables the model to balance inference efficiency with representational capacity, particularly in challenging scenarios. Experimental results show that our method consistently outperforms fixed-rate baselines across multiple multimodal tasks, highlighting the potential of adaptive visual compression for creating more efficient and robust VLMs.
Related papers
- On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression [22.436953683970007]
We show that existing encoder-based attacks can substantially overestimate the robustness of compressed vision-language models (LVLMs)<n>We propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget.
arXiv Detail & Related papers (2026-01-29T10:47:21Z) - Towards Lossless Ultimate Vision Token Compression for VLMs [11.485425012979052]
Lossless Ultimate Vision tokens Compression (LUVC) framework is proposed.<n>LUVC compresses visual tokens until complete elimination at the final layer of language model.<n>Experiments show that LUVC achieves a 2 speedup inference in language model with negligible accuracy degradation.
arXiv Detail & Related papers (2025-12-09T15:40:13Z) - Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation [8.316354570715491]
Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI.<n>We propose a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information.<n>We show that our approach achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline.
arXiv Detail & Related papers (2025-11-24T10:06:41Z) - Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation [75.58269386927076]
Autoregressive (AR) models are often dismissed as impractical due to prohibitive computational cost.<n>This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation.<n> Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression.
arXiv Detail & Related papers (2025-11-14T06:27:58Z) - A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models [94.49953824684853]
We introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition.<n>It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation.<n>An enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate.
arXiv Detail & Related papers (2025-08-03T02:15:43Z) - DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)<n>Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.<n>Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z) - Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference [54.53508601749513]
We propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework.<n>To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features.<n>Results show that TOFC achieves up to 52% reduction in data transmission overhead and 63% reduction in system latency.
arXiv Detail & Related papers (2025-03-17T08:37:22Z) - VRVVC: Variable-Rate NeRF-Based Volumetric Video Compression [59.14355576912495]
NeRF-based video has revolutionized visual media by delivering photorealistic Free-Viewpoint Video (FVV) experiences.<n>The substantial data volumes pose significant challenges for storage and transmission.<n>We propose VRVVC, a novel end-to-end joint variable-rate framework for video compression.
arXiv Detail & Related papers (2024-12-16T01:28:04Z) - Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z) - LVQAC: Lattice Vector Quantization Coupled with Spatially Adaptive
Companding for Efficient Learned Image Compression [24.812267280543693]
We present a novel Lattice Vector Quantization scheme coupled with a spatially Adaptive Companding (LVQAC) mapping.
For any end-to-end CNN image compression models, replacing uniform quantizer by LVQAC achieves better rate-distortion performance without significantly increasing the model complexity.
arXiv Detail & Related papers (2023-03-25T23:34:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.