VisionZip: Longer is Better but Not Necessary in Vision Language Models
- URL: http://arxiv.org/abs/2412.04467v1
- Date: Thu, 05 Dec 2024 18:59:53 GMT
- Title: VisionZip: Longer is Better but Not Necessary in Vision Language Models
- Authors: Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia,
- Abstract summary: Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens.<n>Visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy.<n>We introduce VisionZip, a method that selects a set of informative tokens for input to the language model.
- Score: 53.199716363090154
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .
Related papers
- A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models [94.49953824684853]
We introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition.<n>It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation.<n>An enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate.
arXiv Detail & Related papers (2025-08-03T02:15:43Z) - VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [95.89543460132413]
Vision-language models (VLMs) have improved performance by increasing the number of visual tokens.<n>However, most real-world scenarios do not require such an extensive number of visual tokens.<n>We present a new paradigm for visual token compression, namely, VisionThink.
arXiv Detail & Related papers (2025-07-17T17:59:55Z) - Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment [38.04426918886084]
Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics.<n>Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs)<n>We introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention.
arXiv Detail & Related papers (2025-06-27T14:55:40Z) - VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models [57.2662376527586]
VScan is a two-stage visual token reduction framework.<n>It addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model.<n>VScan achieves a 2.91$times$ speedup in prefilling and a 10$times$ reduction in FLOPs, while retaining 95.4% of the original performance.
arXiv Detail & Related papers (2025-05-28T17:59:08Z) - FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression [16.53645461974695]
Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution images.
We propose an efficient visual token compression framework for text-oriented Vision Large Language Models (VLLMs) in high-resolution scenarios.
Our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks.
arXiv Detail & Related papers (2025-02-22T16:05:33Z) - FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression [45.37530855889661]
High-resolution images lead to a quadratic increase in the number of visual tokens input into Multi-modal Large Language Models.
Current work develop visual token compression methods to achieve efficiency improvements, often at the expense of performance.
We build a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user instructions.
arXiv Detail & Related papers (2024-11-21T15:37:52Z) - Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.
To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.
We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information [41.50379737105869]
We propose a text information-guided dynamic visual token recovery mechanism that does not require training.
Our proposed method achieves comparable performance to the original approach while compressing the visual tokens to an average of 10% of the original quantity.
arXiv Detail & Related papers (2024-09-02T11:19:54Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - Effective End-to-End Vision Language Pretraining with Semantic Visual
Loss [58.642954383282216]
Current vision language pretraining models are dominated by methods using region visual features extracted from object detectors.
We introduce three types of visual losses that enable much faster convergence and better finetuning accuracy.
Compared with region feature models, our end-to-end models could achieve similar or better performance on downstream tasks and run more than 10 times faster during inference.
arXiv Detail & Related papers (2023-01-18T00:22:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.