Related papers: Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

URL: http://arxiv.org/abs/2409.10197v2
Date: Wed, 25 Dec 2024 04:30:16 GMT
Title: Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
Authors: Weihao Ye, Qiong Wu, Wenhao Lin, Yiyi Zhou,
Abstract summary: Token pruning is an effective solution for speeding up MLLMs, but when and how to drop tokens still remains a challenge.<n>We propose a novel and training-free approach for the effective visual token pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning recipe for MLLMs according to a pre-defined budget.
Score: 10.740051410590553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in Multimodal Large Language Models(MLLMs) often use large image tokens to compensate the visual shortcoming of MLLMs, which not only exhibits obvious redundancy but also greatly exacerbates the already high computation. Token pruning is an effective solution for speeding up MLLMs, but when and how to drop tokens still remains a challenge. In this paper, we propose a novel and training-free approach for the effective visual token pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune considers token pruning as a statistical problem of MLLM and its objective is to find out an optimal pruning scheme that can minimize the divergence of the attention distributions before and after pruning. In practice, FitPrune can be quickly accomplished based on the attention statistics from a small batch of inference data, avoiding the expensive trials of MLLMs. According to the pruning recipe, an MLLM can directly remove the redundant visual tokens of different examples during inference. To validate FitPrune, we apply it to a set of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct extensive experiments on a set of benchmarks. The experimental results show that our FitPrune can not only reduce the computational complexity to a large extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in about 5 minutes. Our code is available at https://github.com/ywh187/FitPrune.

Related papers

ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning [68.02825465552779]
We present ThinkPrune, a simple yet effective method for pruning the thinking length for long-thinking LLMs. We show that ThinkPrune results in a remarkable performance-length tradeoff -- on the AIME24 dataset, the reasoning length of DeepSeek-R1-Distill-Qwen-1.5B can be reduced by half with only 2% drop in performance.
arXiv Detail & Related papers (2025-04-02T01:59:26Z)
Dynamic Pyramid Network for Efficient Multimodal Large Language Model [11.864416286283399]
Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks. Recent efforts aim to compress the visual features to save the computational costs of MLLMs. We propose a novel dynamic pyramid network (DPN) for efficient MLLMs.
arXiv Detail & Related papers (2025-03-26T08:44:11Z)
What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph [15.364317811275344]
We propose a graph-based method towards training-free visual token pruning, termed G-Prune. G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Experiment results show that G-Prune can greatly reduce computation overhead while retaining high performance on both coarse- and fine-grained tasks.
arXiv Detail & Related papers (2025-01-04T12:14:42Z)
Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens. We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. Post-training pruning methods are proposed to prune LLMs in one-shot without retraining. Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z)
Skipping Computations in Multimodal LLMs [63.29737699997859]
This study investigates redundancy in Multimodal Large Language Models (MLLMs) during inference. We propose different methods to skip computations, such as skipping entire blocks, FFN or self-attention layers. Our findings validate that significant amount of computations can be avoided at inference time.
arXiv Detail & Related papers (2024-10-12T09:21:45Z)
LIME: Less Is More for MLLM Evaluation [36.29820380945517]
We propose LIME (Less Is More for MLLM Evaluation), a benchmark curated through a semi-automated pipeline. This pipeline filters out uninformative samples and eliminates answer leakage by focusing on tasks that necessitate image-based understanding. Our experiments indicate that LIME reduces the number of samples by 76% and evaluation time by 77%, while also providing a more effective means of distinguishing the capabilities of different models.
arXiv Detail & Related papers (2024-09-10T20:19:14Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods [5.135352292810664]
We show that simple depth pruning can effectively compress large language models (LLMs) Our pruning method boosts inference speeds, especially under memory-constrained conditions. We hope this work can help build compact yet capable LLMs.
arXiv Detail & Related papers (2024-02-05T09:44:49Z)
A Simple and Effective Pruning Approach for Large Language Models [58.716255689941896]
Large Languages Models (LLMs) are natural candidates for network pruning methods. Existing methods, however, require either retraining, or solving a weight reconstruction problem reliant on second-order information. We introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs.
arXiv Detail & Related papers (2023-06-20T17:18:20Z)
LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [56.88751562302793]
Low-rank adaption (LoRA) has emerged to fine-tune large language models (LLMs) LoRAPrune is a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner. LoRAPrune achieves a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.
arXiv Detail & Related papers (2023-05-28T15:15:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.