Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study
- URL: http://arxiv.org/abs/2507.20749v1
- Date: Mon, 28 Jul 2025 11:57:52 GMT
- Title: Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study
- Authors: Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata,
- Abstract summary: Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities.<n>Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs)<n>We propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training.
- Score: 64.26593350748401
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: While Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose significant barriers to practical deployment. Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs), but these methods offer limited flexibility and remain computationally intensive. To address this gap, we propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training. Specifically, we investigate two structural pruning paradigms--layerwise and widthwise pruning--applied to the language model backbone of MLLMs, alongside supervised finetuning and knowledge distillation. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios with limited computational resources or insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels (< 20%). Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved with as little as 5% of the original training data, while retaining over 95% of the original performance. Through empirical study on two representative MLLMs, i.e., LLaVA-v1.5-7B and Bunny-v1.0-3B, this study offers actionable insights for practitioners aiming to compress MLLMs effectively without extensive computation resources or sufficient data.
Related papers
- SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - Efficient Multitask Learning in Small Language Models Through Upside-Down Reinforcement Learning [8.995427413172148]
Small language models (SLMs) can achieve competitive performance in multitask prompt generation tasks.<n>We train an SLM that achieves relevance scores within 5% of state-of-the-art models, including Llama-3, Qwen2, and Mistral, despite being up to 80 times smaller.
arXiv Detail & Related papers (2025-02-14T01:39:45Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs [22.177654792824896]
We focus on small-sized language models (3B to 7B parameters) for their cost-efficiency and accessibility.<n>We explore various training configurations and strategies across four open-source pre-trained models.<n>Key insights from our work include: (i) larger batch sizes paired with lower learning rates lead to improved model performance on benchmarks such as MMLU, MTBench, and Open LLM Leaderboard; (ii) early-stage training dynamics, such as lower gradient norms and higher loss values, are strong indicators of better final model performance; (iv) we observed no significant difference in performance between phased and stacked training strategies, but
arXiv Detail & Related papers (2024-12-17T21:16:59Z) - LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [72.68665884790002]
We propose a novel framework to transfer knowledge from l-MLLMs to s-MLLMs.<n>We introduce Multimodal Distillation (MDist) to transfer teacher model's robust representations across both visual and linguistic modalities.<n>We also propose a three-stage training scheme to fully exploit the potential of the proposed distillation strategy.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [17.381160429641316]
Adaptive Sparse Trainer (AST) is a novel and efficient retraining framework tailored for semi-structured sparse models.<n>AST reduces the perplexity and zero-shot accuracy gap between dense and 2:4 semi-structured sparse models to 0.6 and 1.16%, respectively.
arXiv Detail & Related papers (2024-07-30T06:33:44Z) - MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies [85.57899012821211]
Small Language Models (SLMs) are a resource-efficient alternative to Large Language Models (LLMs)
We introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants.
We also introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K.
arXiv Detail & Related papers (2024-04-09T15:36:50Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.