PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
- URL: http://arxiv.org/abs/2305.17530v1
- Date: Sat, 27 May 2023 17:16:27 GMT
- Title: PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
- Authors: Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi
- Abstract summary: PuMer is a framework that uses text-informed Pruning and modality-aware Merging strategies to progressively reduce the tokens of input image and text.
PuMer inference increases throughput by up to 2x and reduces memory footprint by over 50% while incurring less than a 1% accuracy drop.
- Score: 41.81484883647005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale vision language (VL) models use Transformers to perform
cross-modal interactions between the input text and image. These cross-modal
interactions are computationally expensive and memory-intensive due to the
quadratic complexity of processing the input image and text. We present PuMer:
a token reduction framework that uses text-informed Pruning and modality-aware
Merging strategies to progressively reduce the tokens of input image and text,
improving model inference speed and reducing memory footprint. PuMer learns to
keep salient image tokens related to the input text and merges similar textual
and visual tokens by adding lightweight token reducer modules at several
cross-modal layers in the VL model. Training PuMer is mostly the same as
finetuning the original VL model but faster. Our evaluation for two vision
language models on four downstream VL tasks shows PuMer increases inference
throughput by up to 2x and reduces memory footprint by over 50% while incurring
less than a 1% accuracy drop.
Related papers
- Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment [40.63340635482609]
Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner.
We advocate for assigning distinct contributions for each text token based on its visual correlation.
We introduce Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes training visually correlated tokens.
arXiv Detail & Related papers (2024-05-28T06:44:13Z) - Accelerating Transformers with Spectrum-Preserving Token Merging [43.463808781808645]
PiToMe prioritizes the preservation of informative tokens using an additional metric termed the energy score.
Experimental findings demonstrate that PiToMe saved from 40-60% FLOPs of the base models.
arXiv Detail & Related papers (2024-05-25T09:37:01Z) - LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed.
By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes.
Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z) - LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models [35.88374542519597]
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model.
Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly.
We propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs.
arXiv Detail & Related papers (2024-03-22T17:59:52Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.