IntraSlice: Towards High-Performance Structural Pruning with Block-Intra PCA for LLMs
- URL: http://arxiv.org/abs/2602.01975v1
- Date: Mon, 02 Feb 2026 11:28:56 GMT
- Title: IntraSlice: Towards High-Performance Structural Pruning with Block-Intra PCA for LLMs
- Authors: Meng Li, Peisong Wang, Yuantian Shao, Qinghao Hu, Hongjian Fang, Yifan Zhang, Zhihui Wei, Jian Cheng,
- Abstract summary: Large Language Models (LLMs) achieve strong performance across diverse tasks but face deployment challenges due to their massive size.<n>Recent PCA-based pruning methods have alleviated this issue by retaining key activation components.<n>We propose IntraSlice, a framework that applies block-wise module-intra PCA compression pruning.
- Score: 37.1665041786606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) achieve strong performance across diverse tasks but face deployment challenges due to their massive size. Structured pruning offers acceleration benefits but leads to significant performance degradation. Recent PCA-based pruning methods have alleviated this issue by retaining key activation components, but are only applied between modules in order to fuse the transformation matrix, which introduces extra parameters and severely disrupts activation distributions due to residual connections. To address these issues, we propose IntraSlice, a framework that applies block-wise module-intra PCA compression pruning. By leveraging the structural characteristics of Transformer modules, we design an approximate PCA method whose transformation matrices can be fully fused into the model without additional parameters. We also introduce a PCA-based global pruning ratio estimator that further considers the distribution of compressed activations, building on conventional module importance. We validate our method on Llama2, Llama3, and Phi series across various language benchmarks. Experimental results demonstrate that our approach achieves superior compression performance compared to recent baselines at the same compression ratio or inference speed.
Related papers
- RPCANet++: Deep Interpretable Robust PCA for Sparse Object Segmentation [51.37553739930992]
RPCANet++ is a sparse object segmentation framework that fuses the interpretability of RPCA with efficient deep architectures.<n>Our approach unfolds a relaxed RPCA model into a structured network comprising a Background Approximation Module (BAM), an Object Extraction Module (OEM) and an Image Restoration Module (IRM)<n>Experiments on diverse datasets demonstrate that RPCANet++ achieves state-of-the-art performance under various imaging scenarios.
arXiv Detail & Related papers (2025-08-06T08:19:37Z) - Constrained Edge AI Deployment: Fine-Tuning vs Distillation for LLM Compression [1.85373927927491]
Modern models are often compressed via a combination of structured pruning and re-training to meet the strict compute, memory, and connectivity constraints of edge deployments.<n>Our focus is not on achieving maximal compression, but on isolating the impact of the re-training loss function.<n>We evaluate both pipelines on the OLMo2- 7B-SFT model for CommonsenseQA suitable for intermittent or denied connectivity scenarios typical of edge networks.
arXiv Detail & Related papers (2025-05-13T19:06:32Z) - Optimization of Module Transferability in Single Image Super-Resolution: Universality Assessment and Cycle Residual Blocks [4.937699452538975]
We introduce the concept of "Universality" and its associated definitions, which extend the traditional notion of "Generalization"<n>We then propose the Universality Assessment Equation (UAE), a metric that quantifies how readily a given module can be transplanted across models.<n>We demonstrate that networks embedded with the proposed plug-and-play modules outperform several state-of-the-art methods.
arXiv Detail & Related papers (2025-05-06T13:35:59Z) - Adaptive Pruning of Pretrained Transformer via Differential Inclusions [48.47890215458465]
Current compression algorithms prune transformers at fixed compression ratios, requiring a unique pruning process for each ratio.<n>We propose pruning of pretrained transformers at any desired ratio within a single pruning stage, based on a differential inclusion for a mask parameter.<n>This dynamic can generate the whole regularization solution path of the mask parameter, whose support set identifies the network structure.
arXiv Detail & Related papers (2025-01-06T06:34:52Z) - FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers [30.88764351013966]
Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains.<n>Recent works have observed redundancy within transformer blocks and developed compression methods by structured pruning of less important blocks.<n>We propose FuseGPT, a novel methodology designed to recycle pruned transformer blocks, thereby recovering the model's performance.
arXiv Detail & Related papers (2024-11-21T09:49:28Z) - DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models [62.98273649512654]
Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks.
Increased memory and computational costs associated with these models pose significant challenges for deployment on resource-limited devices.
We propose a novel approach that relaxes the constraint imposed by regular structural pruning methods.
arXiv Detail & Related papers (2024-10-15T18:51:18Z) - MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.<n>MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.<n>Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z) - Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations [21.229296254354878]
We introduce a task-agnostic structured pruning approach coupled with a compact Transformer architecture design.
The proposed approach, named TransAct, reduces transitional activations inside multi-head attention (MHA) and multi-layer perceptron (MLP) modules.
Results verify the optimality of our approach at high compression with respect to both efficiency and performance.
arXiv Detail & Related papers (2024-07-08T07:45:38Z) - Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model [81.55141188169621]
We equip PEFT with a cross-block orchestration mechanism to enable the adaptation of the Segment Anything Model (SAM) to various downstream scenarios.
We propose an intra-block enhancement module, which introduces a linear projection head whose weights are generated from a hyper-complex layer.
Our proposed approach consistently improves the segmentation performance significantly on novel scenarios with only around 1K additional parameters.
arXiv Detail & Related papers (2023-11-28T11:23:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.