SmartTrim: Adaptive Tokens and Attention Pruning for Efficient
Vision-Language Models
- URL: http://arxiv.org/abs/2305.15033v2
- Date: Mon, 26 Feb 2024 15:26:20 GMT
- Title: SmartTrim: Adaptive Tokens and Attention Pruning for Efficient
Vision-Language Models
- Authors: Zekun Wang, Jingchang Chen, Wangchunshu Zhou, Haichao Zhu, Jiafeng
Liang, Liping Shan, Ming Liu, Dongliang Xu, Qing Yang, Bing Qin
- Abstract summary: We propose SmartTrim, an adaptive acceleration framework for Vision-Language Models (VLMs)
We integrate lightweight modules into the original backbone to identify and prune redundant token representations and attention heads within each layer.
We devise a self-distillation strategy to enhance the consistency between the predictions of the pruned model and its fully-capacity counterpart.
- Score: 35.5601603013045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite achieving remarkable performance on various vision-language tasks,
Transformer-based Vision-Language Models (VLMs) suffer from redundancy in
inputs and parameters, significantly hampering their efficiency in real-world
applications. Moreover, the degree of redundancy in token representations and
model parameters, such as attention heads, varies significantly for different
inputs. In light of the challenges, we propose SmartTrim, an adaptive
acceleration framework for VLMs, which adjusts the computational overhead per
instance. Specifically, we integrate lightweight modules into the original
backbone to identify and prune redundant token representations and attention
heads within each layer. Furthermore, we devise a self-distillation strategy to
enhance the consistency between the predictions of the pruned model and its
fully-capacity counterpart. Experimental results across various vision-language
tasks consistently demonstrate that SmartTrim accelerates the original model by
2-3 times with minimal performance degradation, highlighting the effectiveness
and efficiency compared to previous approaches. Code will be available at
https://github.com/kugwzk/SmartTrim.
Related papers
- ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - Exploring Token Pruning in Vision State Space Models [38.122017567843905]
State Space Models (SSMs) have the advantage of keeping linear computational complexity compared to attention modules in transformers.
We take the novel step of enhancing the efficiency of SSM-based vision models through token-based pruning.
We achieve 81.7% accuracy on ImageNet with a 41.6% reduction in the FLOPs for pruned PlainMamba-L3.
arXiv Detail & Related papers (2024-09-27T17:59:50Z) - Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving [9.900979396513687]
Multimodal large language models (MLLMs) have demonstrated remarkable potential for enhancing scene understanding in autonomous driving systems.
One major limitation arises from the large number of visual tokens required to capture fine-grained and long-context visual information.
We propose Video Token Sparsification (VTS) to significantly reduce the total number of visual tokens while preserving the most salient information.
arXiv Detail & Related papers (2024-09-16T05:31:01Z) - Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
SIMA is a framework that enhances visual and language modality alignment through self-improvement.
It employs an in-context self-critic mechanism to select response pairs for preference tuning.
We demonstrate that SIMA achieves superior modality alignment, outperforming previous approaches.
arXiv Detail & Related papers (2024-05-24T23:09:27Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Dynamic Spatial Sparsification for Efficient Vision Transformers and
Convolutional Neural Networks [88.77951448313486]
We present a new approach for model acceleration by exploiting spatial sparsity in visual data.
We propose a dynamic token sparsification framework to prune redundant tokens.
We extend our method to hierarchical models including CNNs and hierarchical vision Transformers.
arXiv Detail & Related papers (2022-07-04T17:00:51Z) - MIA-Former: Efficient and Robust Vision Transformers via Multi-grained
Input-Adaptation [14.866949449862226]
Vision Transformer (ViT) models are too computationally expensive to be fitted onto real-world resource-constrained devices.
We propose a Multi-grained Input-adaptive Vision Transformer framework dubbed MIA-Former that can input-adaptively adjust the structure of ViTs.
Experiments and ablation studies validate that the proposed MIA-Former framework can effectively allocate budgets adaptive to the difficulty of input images.
arXiv Detail & Related papers (2021-12-21T22:06:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.