Related papers: VisionZip: Longer is Better but Not Necessary in Vision Language Models

VisionZip: Longer is Better but Not Necessary in Vision Language Models

URL: http://arxiv.org/abs/2412.04467v1
Date: Thu, 05 Dec 2024 18:59:53 GMT
Title: VisionZip: Longer is Better but Not Necessary in Vision Language Models
Authors: Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia,
Abstract summary: Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens.<n>Visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy.<n>We introduce VisionZip, a method that selects a set of informative tokens for input to the language model.
Score: 53.199716363090154
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .

Related papers

FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression [16.53645461974695]
Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution images. We propose an efficient visual token compression framework for text-oriented Vision Large Language Models (VLLMs) in high-resolution scenarios. Our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks.
arXiv Detail & Related papers (2025-02-22T16:05:33Z)
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression [45.37530855889661]
High-resolution images lead to a quadratic increase in the number of visual tokens input into Multi-modal Large Language Models. Current work develop visual token compression methods to achieve efficiency improvements, often at the expense of performance. We build a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user instructions.
arXiv Detail & Related papers (2024-11-21T15:37:52Z)
Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image. We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information [41.50379737105869]
We propose a text information-guided dynamic visual token recovery mechanism that does not require training. Our proposed method achieves comparable performance to the original approach while compressing the visual tokens to an average of 10% of the original quantity.
arXiv Detail & Related papers (2024-09-02T11:19:54Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs. We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs. Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z)
Effective End-to-End Vision Language Pretraining with Semantic Visual Loss [58.642954383282216]
Current vision language pretraining models are dominated by methods using region visual features extracted from object detectors. We introduce three types of visual losses that enable much faster convergence and better finetuning accuracy. Compared with region feature models, our end-to-end models could achieve similar or better performance on downstream tasks and run more than 10 times faster during inference.
arXiv Detail & Related papers (2023-01-18T00:22:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.