Related papers: LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

URL: http://arxiv.org/abs/2403.15388v5
Date: Wed, 22 May 2024 20:50:37 GMT
Title: LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Authors: Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan,
Abstract summary: Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly. We propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs.
Score: 35.88374542519597
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically take in a fixed and large amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly. However, due to the inherent design of the Transformer architecture, the computational costs of these models tend to increase quadratically with the number of input tokens. To tackle this problem, we explore a token reduction mechanism that identifies significant spatial redundancy among visual tokens. In response, we propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs. Specifically, to metric the importance of each token, we exploit the sparsity observed in the visual encoder, characterized by the sparse distribution of attention scores between the class token and visual tokens. This sparsity enables us to dynamically select the most crucial visual tokens to retain. Subsequently, we cluster the selected (unpruned) tokens based on their key similarity and merge them with the unpruned tokens, effectively supplementing and enhancing their informational content. Empirically, when applied to LLaVA-1.5, our approach can compress the visual tokens by 14 times on average, and achieve comparable performance across diverse visual question-answering and reasoning tasks. Code and checkpoints are at https://llava-prumerge.github.io/.

Related papers

Window Token Concatenation for Efficient Visual Large Language Models [59.6094005814282]
We propose Window Token Concatenation (WiCo) to reduce visual tokens in Visual Large Language Models (VLLMs) WiCo group diverse tokens into one, and thus obscure some fine details. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors.
arXiv Detail & Related papers (2025-04-05T02:32:58Z)
ToFu: Visual Tokens Reduction via Fusion for Multi-modal, Multi-patch, Multi-image Task [34.269081635534526]
We propose ToFu, a visual encoder-agnostic, training-free Token Fusion strategy for high-resolution, multi-image, tasks. We validate our approach on the well-established LLaVA-Interleave Bench, which covers challenging multi-image tasks.
arXiv Detail & Related papers (2025-03-06T14:00:59Z)
[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs [66.5266435598799]
Multi-language Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision tasks. However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements. We introduce a simple yet effective method for train-free visual compression, called VTC- compression.
arXiv Detail & Related papers (2024-12-08T05:29:39Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption. compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder. Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z)
Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens. We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See [37.7015406019386]
Multimodal Large Language Models (MLLMs) treat visual tokens from visual encoders as text tokens. As token counts grow, the quadratic scaling of computation in LLMs introduces an efficiency bottleneck. In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA.
arXiv Detail & Related papers (2024-10-08T16:13:24Z)
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity [85.44800864697464]
We introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. We show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference.
arXiv Detail & Related papers (2024-09-20T10:50:21Z)
Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved great success in many multimodal tasks, but their high computational costs limit their further promotion and application. We studied the visual tokens of MM-LLMs and designed a dynamic pruning algorithm to address this issue. Our proposed method can achieve performance that competes with the original performance when using an average of 22% of the original token quantity.
arXiv Detail & Related papers (2024-09-02T10:49:10Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language. This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok) SeTok groups visual features into semantic units via a dynamic clustering algorithm. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z)
Visual Perception by Large Language Model's Weights [34.34876575183736]
We propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM's weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency.
arXiv Detail & Related papers (2024-05-30T17:59:47Z)
Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens. We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z)
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed. Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.