Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers
- URL: http://arxiv.org/abs/2410.14072v1
- Date: Thu, 17 Oct 2024 22:45:13 GMT
- Title: Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers
- Authors: Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, Mahyar Najibi,
- Abstract summary: We propose a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens.
Victor shows less than a 4% accuracy drop while reducing the total training time by 43% and boosting the inference throughput by 3.3X.
- Score: 32.167072183575925
- License:
- Abstract: Recent advancements in vision-language models (VLMs) have expanded their potential for real-world applications, enabling these models to perform complex reasoning on images. In the widely used fully autoregressive transformer-based models like LLaVA, projected visual tokens are prepended to textual tokens. Oftentimes, visual tokens are significantly more than prompt tokens, resulting in increased computational overhead during both training and inference. In this paper, we propose Visual Compact Token Registers (Victor), a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens. Victor adds a few learnable register tokens after the visual tokens and summarizes the visual information into these registers using the first few layers in the language tower of VLMs. After these few layers, all visual tokens are discarded, significantly improving computational efficiency for both training and inference. Notably, our method is easy to implement and requires a small number of new trainable parameters with minimal impact on model performance. In our experiment, with merely 8 visual registers--about 1% of the original tokens--Victor shows less than a 4% accuracy drop while reducing the total training time by 43% and boosting the inference throughput by 3.3X.
Related papers
- FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder.
Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z) - SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference [45.11612407862277]
In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead.
We propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs.
Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks.
arXiv Detail & Related papers (2024-10-06T09:18:04Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - Matryoshka Query Transformer for Large Vision-Language Models [103.84600181927884]
We introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference.
We train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens.
Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576.
arXiv Detail & Related papers (2024-05-29T17:39:42Z) - LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed.
By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes.
Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z) - Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference.
Our approach is inspired by two intriguing phenomena we have observed.
Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z) - LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models [35.88374542519597]
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model.
Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly.
We propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs.
arXiv Detail & Related papers (2024-03-22T17:59:52Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - How can objects help action recognition? [74.29564964727813]
We investigate how we can use knowledge of objects to design better video models.
First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens.
Second, we propose an object-aware attention module that enriches our feature representation with object information.
arXiv Detail & Related papers (2023-06-20T17:56:16Z) - Revisiting Token Pruning for Object Detection and Instance Segmentation [25.3324628669201]
We investigate token pruning to accelerate inference for object and instance segmentation.
We show a reduction in performance decline from 1.5 mAP to 0.3 mAP in both boxes and masks, compared to existing token pruning methods.
arXiv Detail & Related papers (2023-06-12T11:55:33Z) - TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [89.17394772676819]
We introduce a novel visual representation learning which relies on a handful of adaptively learned tokens.
Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks.
arXiv Detail & Related papers (2021-06-21T17:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.