Related papers: Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers

Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers

URL: http://arxiv.org/abs/2507.23362v1
Date: Thu, 31 Jul 2025 09:17:53 GMT
Title: Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers
Authors: Ji Ma, Wei Suo, Peng Wang, Yanning Zhang,
Abstract summary: Vision-language models (LVLMs) have demonstrated impressive capabilities in multi-modal understanding and reasoning.<n>Recent efforts from natural language processing (NLP) have shown the effectiveness of layer pruning, offering a training-free compression solution.<n>However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs.
Score: 45.233150828317164
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although large vision-language models (LVLMs) have demonstrated impressive capabilities in multi-modal understanding and reasoning, their practical applications are still limited by massive model parameters and high computational costs. Recent efforts from natural language processing (NLP) have shown the effectiveness of layer pruning, offering a plausible training-free compression solution. However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs. In this paper, we empirically prove that directly applying these layer pruning methods to LVLMs is ineffective. Through extensive experiments, we find that non-essential vision-language (VL) tokens and inter-layer feature gaps pose critical challenges to pruning layers in LVLMs. Based on these insights, we propose a novel framework Short-LVLM (SVL) that can utilize important VL tokens and mitigate the layer-wise feature gaps. Notably, Short-LVLM not only achieves a superior trade-off between performance and efficiency but also exhibits several potential advantages, i.e., training-free, model-agnostic, and highly compatible. The code for this work is publicly available at https://github.com/ASGO-MM/Short-LVLM.

Related papers

When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs [4.296395082987112]
Large Vision-Language Models (L-VLMs) have demonstrated remarkable performance in various vision and language tasks.<n>Small Vision-Language Models (S-VLMs) offer efficiency but suffer from a significant performance gap compared to their larger counterparts.<n>We introduce the Model Parity Aligner (MPA), a novel framework designed to systematically improve S-VLMs.
arXiv Detail & Related papers (2025-09-20T11:12:23Z)
CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models [75.88232735646018]
Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos.<n>Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations.<n>We propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM.
arXiv Detail & Related papers (2025-08-24T07:47:00Z)
MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation [24.200547898713126]
Multimodal Large Language Models (MLLMs) excel in understanding complex language and visual data.<n>Their real-world deployment is hindered by substantial computational and storage demands.<n>We propose a Mixture-of-Layers Vision-Language-Action model (MoLe) architecture for dynamic LLM layer activation.
arXiv Detail & Related papers (2025-03-26T10:05:38Z)
LangBridge: Interpreting Image as a Combination of Language Embeddings [64.36674412359778]
LangBridge is a novel adapter that explicitly maps visual tokens to linear combinations of text embeddings.<n>Our results demonstrate that a LangBridge pre-trained on Qwen2-0.5B can be directly applied to larger models such as LLaMA3-8B or Qwen2.5-14B while maintaining competitive performance.
arXiv Detail & Related papers (2025-03-25T07:24:27Z)
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [95.78870389271832]
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<n>We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations.<n>We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
arXiv Detail & Related papers (2024-12-12T18:55:18Z)
Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD) [13.430637580980164]
Large Vision-Language Models (LVLMs) are an extension of Large Language Models (LLMs) that facilitate processing both image and text inputs, expanding AI capabilities. Our study introduces a Language Contrastive Decoding (LCD) algorithm that adjusts LVLM outputs based on Large Language Models distribution confidence levels. Our method effectively improves LVLMs without needing complex post-processing or retraining, and is easily applicable to different models.
arXiv Detail & Related papers (2024-08-06T08:10:34Z)
Large Visual-Language Models Are Also Good Classifiers: A Study of In-Context Multimodal Fake News Detection [0.18416014644193068]
This paper first assesses the FND capabilities of two notable LVLMs, CogVLM and GPT4V, in comparison to a smaller yet adeptly trained CLIP model in a zero-shot context.<n>Next, we integrate standard in-context learning (ICL) with LVLMs, noting improvements in FND performance, though limited in scope and consistency.<n>We introduce the textbfIn-context textbfMultimodal textbfFake textbfNews textbfD
arXiv Detail & Related papers (2024-07-16T09:28:23Z)
LM4LV: A Frozen Large Language Model for Low-level Vision Tasks [25.3601306724822]
$textbfLM4LV$ is a framework that enables a large language model to solve a range of low-level vision tasks without any multi-modal data or prior. This showcases the LLM's strong potential in low-level vision and bridges the gap between MLLMs and low-level vision tasks.
arXiv Detail & Related papers (2024-05-24T17:25:00Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance [51.30560006045442]
Image-gRounded guIdaNcE (MARINE) is a framework that is both training-free and API-free.<n>MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs.<n>Our framework's flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance.
arXiv Detail & Related papers (2024-02-13T18:59:05Z)
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [49.32669226551026]
We propose a simple yet effective training strategy MoE-Tuning for LVLMs.<n>MoE-LLaVA, a MoE-based sparse LVLM architecture, uniquely activates only the top-k experts through routers.<n>Experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks.
arXiv Detail & Related papers (2024-01-29T08:13:40Z)
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models [77.2078051555533]
We propose a novel and affordable solution for the effective VL adaption of large language models (LLMs) Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters. MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions.
arXiv Detail & Related papers (2023-05-24T11:06:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.