AdaFV: Rethinking of Visual-Language alignment for VLM acceleration
- URL: http://arxiv.org/abs/2501.09532v2
- Date: Sat, 01 Feb 2025 06:13:27 GMT
- Title: AdaFV: Rethinking of Visual-Language alignment for VLM acceleration
- Authors: Jiayi Han, Liang Du, Yiwen Wu, Xiangguo Zhou, Hongwei Du, Weibo Zheng,
- Abstract summary: Some approaches to reduce the visual tokens according to the self-attention of VLMs, which are biased, result in inaccurate responses.<n>We propose a self-adaptive cross-modality attention mixture mechanism that dynamically leverages the effectiveness of visual saliency and text-to-image similarity.<n>The proposed approach achieves state-of-the-art training-free VLM acceleration performance, especially when the reduction rate is sufficiently large.
- Score: 7.9213473377478865
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of redundant visual tokens, thus significantly reducing the efficiency of the VLMs. To improve the VLMs' efficiency without introducing extra training costs, many research works are proposed to reduce the visual tokens by filtering the uninformative visual tokens or aggregating their information. Some approaches propose to reduce the visual tokens according to the self-attention of VLMs, which are biased, to result in inaccurate responses. The token reduction approaches solely rely on visual cues are text-agnostic, and fail to focus on the areas that are most relevant to the question, especially when the queried objects are non-salient to the image. In this work, we first conduct experiments to show that the original text embeddings are aligned with the visual tokens, without bias on the tailed visual tokens. We then propose a self-adaptive cross-modality attention mixture mechanism that dynamically leverages the effectiveness of visual saliency and text-to-image similarity in the pre-LLM layers to select the visual tokens that are informative. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art training-free VLM acceleration performance, especially when the reduction rate is sufficiently large.
Related papers
- Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference [28.24397677839652]
Multimodal large language models (MLLMs) improve performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models.
How MLLMs process and utilize visual information remains unclear.
We propose Hierarchical Modality-Aware Pruning (HiMAP), a plug-and-play inference acceleration method that dynamically prunes image tokens at specific layers, reducing computational costs by approximately 65% without sacrificing performance.
arXiv Detail & Related papers (2025-03-17T12:31:23Z) - Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection [53.558449071113245]
Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM)
Recent advancements in vision-language modeling introduce image cropping techniques that feed all encoded sub-images into the model.
We propose a lightweight, universal framework that seamlessly integrates with existing VLMs to enhance their ability to process finegrained details.
arXiv Detail & Related papers (2025-03-14T18:33:31Z) - Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images [7.823336661261962]
Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors.
We propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details.
arXiv Detail & Related papers (2025-02-19T18:05:42Z) - [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs [66.5266435598799]
Multi-language Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision tasks.<n>However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements.<n>We introduce a simple yet effective method for train-free visual compression, called VTC- compression.
arXiv Detail & Related papers (2024-12-08T05:29:39Z) - A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs [65.00970402080351]
A promising approach to accelerating large vision-language models (VLMs) is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens.<n>Our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains comparable performance under aggressive pruning; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM,
arXiv Detail & Related papers (2024-12-04T13:56:44Z) - Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.<n> compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.<n>We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z) - FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder.
Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z) - Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.
To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.
We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - Efficient Large Multi-modal Models via Visual Context Compression [23.966237939194514]
We present the study on the analysis of redundancy concerning visual tokens and efficient training within large language models.
Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy.
We introduce Visual Context on the GQA benchmark, which reduces the number of visual tokens to enhance training and inference efficiency without sacrificing performance.
arXiv Detail & Related papers (2024-06-28T17:57:14Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.