Related papers: INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models

INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models

URL: http://arxiv.org/abs/2511.19676v1
Date: Mon, 24 Nov 2025 20:24:22 GMT
Title: INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models
Authors: Parsa Madinei, Ryan Solgi, Ziqi Wen, Jonathan Skaza, Miguel Eckstein, Ramtin Pedarsani,
Abstract summary: We introduce INTERLACE, a novel framework that prunes redundant layers in VLMs while maintaining performance through sample-efficient finetuning.<n>We analyze triplets of consecutive layers to identify local redundancy, removing the most redundant of the first two layers, finetune the remaining layer to compensate for the lost capacity, and freeze the third layer to serve as a stable anchor during finetuning.<n>By finetuning only a subset of layers on just 1% of the FineVision dataset for one epoch, Interlace achieves 88.9% average performance retention after dropping 25% of the network, achieving SOTA performance.
Score: 10.262304700896197
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce INTERLACE, a novel framework that prunes redundant layers in VLMs while maintaining performance through sample-efficient finetuning. Existing layer pruning methods lead to significant performance drop when applied to VLMs. Instead, we analyze triplets of consecutive layers to identify local redundancy, removing the most redundant of the first two layers, finetune the remaining layer to compensate for the lost capacity, and freeze the third layer to serve as a stable anchor during finetuning. We found that this interleaved finetune-freeze design enables rapid convergence with minimal data after pruning. By finetuning only a subset of layers on just 1% of the FineVision dataset for one epoch, Interlace achieves 88.9% average performance retention after dropping 25% of the network, achieving SOTA performance. Our code is available at: https://github.com/pmadinei/Interlace.git

Related papers

GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs [10.61152477422108]
GradPruner can prune layers of Large Language Models guided by gradients in the early stages of fine-tuning.<n>Results demonstrate that GradPruner has achieved a parameter reduction of 40% with only a 0.99% decrease in accuracy.
arXiv Detail & Related papers (2026-01-27T11:41:26Z)
LitePT: Lighter Yet Stronger Point Transformer [50.6430530112838]
We analyse the role of different computational blocks in 3D point cloud networks.<n>We propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers.<n>The resulting LitePT model has $3.6times$ fewer parameters, runs $2times$ faster, and uses $2times$ less memory than the state-of-the-art Point Transformer V3.
arXiv Detail & Related papers (2025-12-15T18:59:57Z)
Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation [43.822941944402544]
Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands.<n>Recent works have sought to reduce their model size through layer-wise structured pruning.<n>We re-examine structured pruning paradigms and uncover several key limitations.
arXiv Detail & Related papers (2025-10-17T04:27:06Z)
FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction [16.84400858871298]
We propose FiRST, an algorithm that reduces latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence.<n>FiRST preserves compatibility with KV caching enabling faster inference while being quality-aware.<n>Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on tasks.
arXiv Detail & Related papers (2024-10-16T12:45:35Z)
LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging [20.774060844559838]
Existing depth compression methods remove redundant non-linear activation functions and merge the consecutive convolution layers into a single layer. These methods suffer from a critical drawback; the kernel size of the merged layers becomes larger. We show that this problem can be addressed by jointly pruning convolution layers and activation functions. We propose LayerMerge, a novel depth compression method that selects which activation layers and convolution layers to remove.
arXiv Detail & Related papers (2024-06-18T17:55:15Z)
FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models [54.787308652357794]
FinerCut is a new form of fine-grained layer pruning for transformer networks. Our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction.
arXiv Detail & Related papers (2024-05-28T14:21:15Z)
Streamlining Redundant Layers to Compress Large Language Models [21.27944103424621]
This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs)<n>It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned.<n>Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.
arXiv Detail & Related papers (2024-03-28T04:12:13Z)
Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum. Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels. They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs. Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)
HANT: Hardware-Aware Network Transformation [82.54824188745887]
We propose hardware-aware network transformation (HANT) HANT replaces inefficient operations with more efficient alternatives using a neural architecture search like approach. Our results on accelerating the EfficientNet family show that HANT can accelerate them by up to 3.6x with 0.4% drop in the top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-07-12T18:46:34Z)
Convolutional Networks with Dense Connectivity [59.30634544498946]
We introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks.
arXiv Detail & Related papers (2020-01-08T06:54:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.