MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for
Accelerating Vision-Language Transformer
- URL: http://arxiv.org/abs/2403.02991v1
- Date: Tue, 5 Mar 2024 14:13:50 GMT
- Title: MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for
Accelerating Vision-Language Transformer
- Authors: Jianjian Cao and Peng Ye and Shengze Li and Chong Yu and Yansong Tang
and Jiwen Lu and Tao Chen
- Abstract summary: Vision-Language Transformers (VLTs) have shown great success recently, but are accompanied by heavy computation costs.
We propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs.
- Score: 66.71930982549028
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Transformers (VLTs) have shown great success recently, but
are meanwhile accompanied by heavy computation costs, where a major reason can
be attributed to the large number of visual and language tokens. Existing token
pruning research for compressing VLTs mainly follows a single-modality-based
scheme yet ignores the critical role of aligning different modalities for
guiding the token pruning process, causing the important tokens for one
modality to be falsely pruned in another modality branch. Meanwhile, existing
VLT pruning works also lack the flexibility to dynamically compress each layer
based on different input samples. To this end, we propose a novel framework
named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for
accelerating various VLTs. Specifically, we first introduce a well-designed
Multi-modality Alignment Guidance (MAG) module that can align features of the
same semantic concept from different modalities, to ensure the pruned tokens
are less important for all modalities. We further design a novel Dynamic Token
Pruning (DTP) module, which can adaptively adjust the token compression ratio
in each layer based on different input instances. Extensive experiments on
various benchmarks demonstrate that MADTP significantly reduces the
computational complexity of kinds of multimodal models while preserving
competitive performance. Notably, when applied to the BLIP model in the NLVR2
dataset, MADTP can reduce the GFLOPs by 80% with less than 4% performance
degradation.
Related papers
- VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation [18.9885501527331]
Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance.
Image token pruning is one of the most effective strategies to address this complexity.
This work introduces the Vision Language Guided Token Pruning (VLTP), a novel token pruning mechanism that can accelerate ViTbased segmentation models.
arXiv Detail & Related papers (2024-09-13T01:30:24Z) - AdapMTL: Adaptive Pruning Framework for Multitask Learning Model [5.643658120200373]
AdapMTL is an adaptive pruning framework for multitask models.
It balances sparsity allocation and accuracy performance across multiple tasks.
It showcases superior performance compared to state-of-the-art pruning methods.
arXiv Detail & Related papers (2024-08-07T17:19:15Z) - MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning [28.254318215697527]
Vision-Language models (VLMs) come with high computational costs due to their large number of parameters.
Existing techniques for VLMs are task-specific, and thus require pruning the network from scratch for each new task of interest.
We explore a new direction: Task-Agnostic Vision-Language Pruning (TA-language)
We propose Multimodal FlowPruning (MULTIFLOW), a first, gradient-free, pruning framework for TA-language.
arXiv Detail & Related papers (2024-04-08T15:51:21Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Lightweight In-Context Tuning for Multimodal Unified Models [57.10831399642176]
MultiModal In-conteXt Tuning (M$2$IXT) is a lightweight module to enhance the ICL capabilities of multimodal unified models.
When tuned on as little as 50K multimodal data, M$2$IXT can boost the few-shot ICL performance significantly.
arXiv Detail & Related papers (2023-10-08T10:47:24Z) - FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing [88.6654909354382]
We present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT) for face anti-spoofing.
FM-ViT can flexibly target any single-modal (i.e., RGB) attack scenarios with the help of available multi-modal data.
Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-05-05T04:28:48Z) - MA-ViT: Modality-Agnostic Vision Transformers for Face Anti-Spoofing [3.3031006227198003]
We present Modality-Agnostic Vision Transformer (MA-ViT), which aims to improve the performance of arbitrary modal attacks with the help of multi-modal data.
Specifically, MA-ViT adopts the early fusion to aggregate all the available training modalities data and enables flexible testing of any given modal samples.
Experiments demonstrate that the single model trained on MA-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-04-15T13:03:44Z) - Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - Dynamic Clone Transformer for Efficient Convolutional Neural Netwoks [0.0]
We introduce a novel concept termed multi-path fully connected pattern (MPFC) to rethink the interdependencies of topology pattern, accuracy and efficiency for ConvNets.
Inspired by MPFC, we propose a dual-branch module named dynamic clone transformer (DCT) where one branch generates multiple replicas from inputs and another branch reforms those clones through a series of difference vectors conditional on inputs itself to produce more variants.
arXiv Detail & Related papers (2021-06-12T13:42:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.