LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models
- URL: http://arxiv.org/abs/2501.13652v2
- Date: Sun, 09 Mar 2025 21:32:52 GMT
- Title: LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models
- Authors: Yizheng Sun, Yanze Xin, Hao Li, Jingyuan Sun, Chenghua Lin, Riza Batista-Navarro,
- Abstract summary: We introduce Language-Guided Vision Token Pruning (LVPruning) for MLLMs.<n>LVPruning employs cross-attention modules to compute the importance of vision tokens based on their interaction with language tokens.<n>Experiments show that LVPruning can effectively reduce up to 90% of vision tokens by the middle layer of LLaVA-1.5.
- Score: 18.489240454283834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal Large Language Models (MLLMs) have achieved remarkable success by integrating visual and textual modalities. However, they incur significant computational overhead due to the large number of vision tokens processed, limiting their practicality in resource-constrained environments. We introduce Language-Guided Vision Token Pruning (LVPruning) for MLLMs, an effective yet simple method that significantly reduces the computational burden while preserving model performance. LVPruning employs cross-attention modules to compute the importance of vision tokens based on their interaction with language tokens, determining which to prune. Importantly, LVPruning can be integrated without modifying the original MLLM parameters, which makes LVPruning simple to apply or remove. Our experiments show that LVPruning can effectively reduce up to 90% of vision tokens by the middle layer of LLaVA-1.5, resulting in a 62.1% decrease in inference Tera Floating-Point Operations Per Second (TFLOPs), with an average performance loss of just 0.45% across nine multi-modal benchmarks.
Related papers
- Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers [45.233150828317164]
Vision-language models (LVLMs) have demonstrated impressive capabilities in multi-modal understanding and reasoning.<n>Recent efforts from natural language processing (NLP) have shown the effectiveness of layer pruning, offering a training-free compression solution.<n>However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs.
arXiv Detail & Related papers (2025-07-31T09:17:53Z) - Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study [64.26593350748401]
Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities.<n>Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs)<n>We propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training.
arXiv Detail & Related papers (2025-07-28T11:57:52Z) - Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models [70.59376970630387]
This paper focuses on monolithic Multimodal Large Language Models (MLLMs)<n>Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting.<n>To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning.
arXiv Detail & Related papers (2025-07-16T18:31:23Z) - LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation [17.318287255400175]
We present LaVi, a novel LVLM that enables seamless and efficient vision-language fusion.<n>Unlike dominant LVLMs that rely on visual token concatenation, LaVi bypasses long-context expansion.<n>Compared to LLaVA-OV-7B, LaVi reduces FLOPs by 94.0%, improves inference speed by 3.1 times, and cuts memory usage in half.
arXiv Detail & Related papers (2025-06-20T02:25:33Z) - Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective [6.258220461022373]
Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency.<n>We show that token compression is feasible at the input stage of LLM with negligible performance loss.<n>We propose to learn a mapping from the attention map of the first LLM layer to the explanation results, thereby avoiding the need for a full inference pass.
arXiv Detail & Related papers (2025-06-01T17:44:16Z) - Liquid: Language Models are Scalable and Unified Multi-modal Generators [112.71734051183726]
Liquid is an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation.
Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using a single large language model.
For the first time, Liquid uncovers a scaling law that performance drop unavoidably brought by the unified training of visual and language tasks.
arXiv Detail & Related papers (2024-12-05T16:48:16Z) - VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models [63.27511432647797]
We propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes.
We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V.
arXiv Detail & Related papers (2024-12-02T18:58:25Z) - Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction [62.8375542401319]
Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone.
The number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs.
We propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep.
arXiv Detail & Related papers (2024-11-30T18:54:32Z) - LLaVA-CoT: Let Vision Language Models Reason Step-by-Step [36.042551817732964]
We introduce LLaVA-CoT, a novel VLM designed to conduct autonomous multistage reasoning.
Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation.
With only 100k training samples and a simple yet effective inference time scaling method, LLaVA-CoT not only outperforms its base model by 7.4% on a wide range of multimodal reasoning benchmarks.
arXiv Detail & Related papers (2024-11-15T18:58:31Z) - LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [72.02635550088546]
This work explores how large language models (LLMs) can enhance CLIP's capability, especially for processing longer and more complex image captions.<n>We introduce a caption-to-caption contrastive fine-tuning framework, significantly enhancing the discriminative quality of LLM outputs.<n>Our approach outperforms LoRA-based methods, achieving nearly fourfold faster training with superior performance.
arXiv Detail & Related papers (2024-11-07T18:59:16Z) - Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters.
We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z) - LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM.
Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM.
We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training [48.455597568212944]
We present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure.
In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data.
arXiv Detail & Related papers (2024-10-10T17:59:22Z) - Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See [37.7015406019386]
Multimodal Large Language Models (MLLMs) treat visual tokens from visual encoders as text tokens.
As token counts grow, the quadratic scaling of computation in LLMs introduces an efficiency bottleneck.
In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA.
arXiv Detail & Related papers (2024-10-08T16:13:24Z) - EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model [14.767055057048855]
We introduce the Data-Efficient and Compute-Efficient Multimodal Large Language Model (EE-MLLM)
EE-MLLM achieves both data and compute efficiency without introducing additional modules or learnable parameters.
Experimental results demonstrate the effectiveness of EE-MLLM across a range of benchmarks.
arXiv Detail & Related papers (2024-08-21T17:36:37Z) - MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [49.32669226551026]
We propose a simple yet effective training strategy MoE-Tuning for LVLMs.<n>MoE-LLaVA, a MoE-based sparse LVLM architecture, uniquely activates only the top-k experts through routers.<n>Experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks.
arXiv Detail & Related papers (2024-01-29T08:13:40Z) - EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge
Distillation and Modal-adaptive Pruning [19.354515754130592]
We introduce a distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones.
We apply our framework to train EfficientVLM, a fast and accurate vision-language model consisting of 6 vision layers, 3 text layers, and 3 cross-modal fusion layers.
EfficientVLM retains 98.4% performance of the teacher model and accelerates its inference speed by 2.2x.
arXiv Detail & Related papers (2022-10-14T13:26:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.