Related papers: Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization

Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization

URL: http://arxiv.org/abs/2503.09657v2
Date: Tue, 18 Mar 2025 01:51:05 GMT
Title: Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization
Authors: Guanchen Li, Yixing Xu, Zeping Li, Ji Liu, Xuanwu Yin, Dong Li, Emad Barsoum,
Abstract summary: T'yr-the-Pruner is an efficient end-to-end search-based global structural pruning framework.<n>We introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction.<n>Results show that T'yr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3.1-70B's parameters.
Score: 15.027017826182659
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Structural pruning enhances hardware-agnostic inference efficiency for large language models (LLMs) but often struggles to maintain performance. Local pruning performs efficient layer-by-layer compression but ignores global topology. Global pruning has the potential to find the optimal solution although resource-intensive. However, existing methods tend to rank structural saliency uniformly, ignoring inter-structure dependencies and failing to achieve end-to-end optimization. To address these limitations, we propose T\'yr-the-Pruner, an efficient end-to-end search-based global structural pruning framework. This framework constructs a supernet by repeatedly applying local pruning across a range of sparsity ratios to each layer in an LLM, with the core goal of determining the optimal sparsity distribution under a target overall sparsity ratio. Concretely, we introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction. Furthermore, we employ an iterative prune-and-search strategy with coarse-to-fine sparsity granularity to ensure efficient search convergence. Experimental results show that T\'yr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3.1-70B's parameters.

Related papers

Progressive Binarization with Semi-Structured Pruning for LLMs [36.32239429974179]
Large language models (LLMs) have achieved remarkable success in natural language processing tasks.<n>Their high computational and memory demands pose challenges for deployment on resource-constrained devices.<n>We propose a Progressive Binarization with Semi-Structured Pruning (PBS$2$P) method for LLM compression.
arXiv Detail & Related papers (2025-02-03T13:30:29Z)
SlimGPT: Layer-wise Structured Pruning for Large Language Models [15.252798256418279]
Batched Greedy Pruning for rapid and near-optimal pruning.<n>Incremental Pruning Ratio, a non-uniform pruning strategy to reduce performance degradation.<n> Experimental results on the LLaMA benchmark show that SlimGPT outperforms other methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-12-24T02:49:50Z)
RL-Pruner: Structured Pruning Using Reinforcement Learning for CNN Compression and Acceleration [0.0]
We propose RL-Pruner, which uses reinforcement learning to learn the optimal pruning distribution. RL-Pruner can automatically extract dependencies between filters in the input model and perform pruning, without requiring model-specific pruning implementations.
arXiv Detail & Related papers (2024-11-10T13:35:10Z)
CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information [33.01180010689081]
We introduce an efficient structured pruning framework named CFSP.<n>We first allocate the sparsity budget across blocks based on their importance and then retain important weights within each block.<n>Results demonstrate that CFSP outperforms existing methods on diverse models across various sparsity budgets.
arXiv Detail & Related papers (2024-09-20T04:03:27Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models [14.310720048047136]
ALPS is an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned gradient conjugate-based post-processing step. Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency. On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.
arXiv Detail & Related papers (2024-06-12T02:57:41Z)
Dynamic Regularized Sharpness Aware Minimization in Federated Learning: Approaching Global Consistency and Smooth Landscape [59.841889495864386]
In federated learning (FL), a cluster of local clients are chaired under the coordination of a global server. Clients are prone to overfit into their own optima, which extremely deviates from the global objective. ttfamily FedSMOO adopts a dynamic regularizer to guarantee the local optima towards the global objective. Our theoretical analysis indicates that ttfamily FedSMOO achieves fast $mathcalO (1/T)$ convergence rate with low bound generalization.
arXiv Detail & Related papers (2023-05-19T10:47:44Z)
Pruning-as-Search: Efficient Neural Architecture Search via Channel Pruning and Structural Reparameterization [50.50023451369742]
Pruning-as-Search (PaS) is an end-to-end channel pruning method to search out desired sub-network automatically and efficiently. Our proposed architecture outperforms prior arts by around $1.0%$ top-1 accuracy on ImageNet-1000 classification task.
arXiv Detail & Related papers (2022-06-02T17:58:54Z)
Efficient Micro-Structured Weight Unification and Pruning for Neural Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices. Previous unstructured or structured weight pruning methods can hardly truly accelerate inference. We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z)
Dynamic Probabilistic Pruning: A general framework for hardware-constrained pruning at different granularities [80.06422693778141]
We propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps) We refer to this algorithm as Dynamic Probabilistic Pruning (DPP) We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification.
arXiv Detail & Related papers (2021-05-26T17:01:52Z)
Discretization-Aware Architecture Search [81.35557425784026]
This paper presents discretization-aware architecture search (DAtextsuperscript2S) The core idea is to push the super-network towards the configuration of desired topology, so that the accuracy loss brought by discretization is largely alleviated. Experiments on standard image classification benchmarks demonstrate the superiority of our approach.
arXiv Detail & Related papers (2020-07-07T01:18:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.