PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
- URL: http://arxiv.org/abs/2305.17530v1
- Date: Sat, 27 May 2023 17:16:27 GMT
- Title: PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
- Authors: Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi
- Abstract summary: PuMer is a framework that uses text-informed Pruning and modality-aware Merging strategies to progressively reduce the tokens of input image and text.
PuMer inference increases throughput by up to 2x and reduces memory footprint by over 50% while incurring less than a 1% accuracy drop.
- Score: 41.81484883647005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale vision language (VL) models use Transformers to perform
cross-modal interactions between the input text and image. These cross-modal
interactions are computationally expensive and memory-intensive due to the
quadratic complexity of processing the input image and text. We present PuMer:
a token reduction framework that uses text-informed Pruning and modality-aware
Merging strategies to progressively reduce the tokens of input image and text,
improving model inference speed and reducing memory footprint. PuMer learns to
keep salient image tokens related to the input text and merges similar textual
and visual tokens by adding lightweight token reducer modules at several
cross-modal layers in the VL model. Training PuMer is mostly the same as
finetuning the original VL model but faster. Our evaluation for two vision
language models on four downstream VL tasks shows PuMer increases inference
throughput by up to 2x and reduces memory footprint by over 50% while incurring
less than a 1% accuracy drop.
Related papers
- ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)
We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation [7.742746565876165]
The interpretability of LVLMs remains an under-explored area.
In models such as LLaVA1.5, image tokens that are semantically related to text are more likely to have information flow convergence.
We propose a new image token reduction method, Simignore, which aims to improve the complex reasoning ability of LVLMs.
arXiv Detail & Related papers (2024-12-13T03:13:44Z) - The Narrow Gate: Localized Image-Text Communication in Vision-Language Models [36.33608889682152]
We compare vision-language models (VLMs) that generate both images and text with those that output only text.
We find that in models with multimodal outputs, image and text embeddings are more separated within the residual stream.
In contrast, models trained for image and text generation rely on a single token that acts as a narrow gate for the visual information.
arXiv Detail & Related papers (2024-12-09T16:39:40Z) - iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models [24.0346607116299]
We introduce iLLaVA, a simple method that can be seamlessly deployed upon current Large Vision-Language Models (LVLMs)
iLLaVA achieves this by finding and gradually merging the redundant tokens with an accurate and fast algorithm.
On tasks across different domains including single-image, multi-images and videos, iLLaVA demonstrates strong generalizability with consistently promising efficiency.
arXiv Detail & Related papers (2024-12-09T07:22:19Z) - AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions.
We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images.
Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z) - Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed.
By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes.
Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.