iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
- URL: http://arxiv.org/abs/2412.06263v1
- Date: Mon, 09 Dec 2024 07:22:19 GMT
- Title: iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
- Authors: Lianyu Hu, Fanhua Shang, Liang Wan, Wei Feng,
- Abstract summary: We introduce iLLaVA, a simple method that can be seamlessly deployed upon current Large Vision-Language Models (LVLMs)
iLLaVA achieves this by finding and gradually merging the redundant tokens with an accurate and fast algorithm.
On tasks across different domains including single-image, multi-images and videos, iLLaVA demonstrates strong generalizability with consistently promising efficiency.
- Score: 24.0346607116299
- License:
- Abstract: In this paper, we introduce iLLaVA, a simple method that can be seamlessly deployed upon current Large Vision-Language Models (LVLMs) to greatly increase the throughput with nearly lossless model performance, without a further requirement to train. iLLaVA achieves this by finding and gradually merging the redundant tokens with an accurate and fast algorithm, which can merge hundreds of tokens within only one step. While some previous methods have explored directly pruning or merging tokens in the inference stage to accelerate models, our method excels in both performance and throughput by two key designs. First, while most previous methods only try to save the computations of Large Language Models (LLMs), our method accelerates the forward pass of both image encoders and LLMs in LVLMs, which both occupy a significant part of time during inference. Second, our method recycles the beneficial information from the pruned tokens into existing tokens, which avoids directly dropping context tokens like previous methods to cause performance loss. iLLaVA can nearly 2$\times$ the throughput, and reduce the memory costs by half with only a 0.2\% - 0.5\% performance drop across models of different scales including 7B, 13B and 34B. On tasks across different domains including single-image, multi-images and videos, iLLaVA demonstrates strong generalizability with consistently promising efficiency. We finally offer abundant visualizations to show the merging processes of iLLaVA in each step, which show insights into the distribution of computing resources in LVLMs. Code is available at https://github.com/hulianyuyy/iLLaVA.
Related papers
- ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming [14.937905258757635]
$textbfST3$ is a framework designed to accelerate MLLM inference without retraining.
$textbfST3$ can be seamlessly integrated into existing pre-trained MLLMs.
arXiv Detail & Related papers (2024-12-28T10:17:29Z) - PruneVid: Visual Token Pruning for Efficient Video Large Language Models [24.889834611542955]
We introduce PruneVid, a visual token pruning method designed to enhance the efficiency of multi-modal video understanding.
LLMs have shown promising performance in video tasks due to their extended capabilities in comprehending visual modalities.
We validate our method across multiple video benchmarks, which demonstrate that PruneVid can prune over 80% of tokens.
arXiv Detail & Related papers (2024-12-20T18:01:58Z) - Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction [62.8375542401319]
Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone.
The number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs.
We propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep.
arXiv Detail & Related papers (2024-11-30T18:54:32Z) - ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models [32.6661928486072]
ATP-LLaVA is a novel approach that adaptively determines instance-specific token pruning ratios for each Large Language Model layer.
Our approach reduces the average token count by 75% while maintaining performance, with only a minimal 1.9% degradation across seven widely used benchmarks.
arXiv Detail & Related papers (2024-11-30T11:42:35Z) - FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder.
Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z) - Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens.
We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.