Related papers: PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models

PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models

URL: http://arxiv.org/abs/2510.10136v1
Date: Sat, 11 Oct 2025 09:40:27 GMT
Title: PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models
Authors: Lancheng Zou, Shuo Yin, Zehua Pei, Tsung-Yi Ho, Farzan Farnia, Bei Yu,
Abstract summary: Channel permutation is a powerful technique for enhancing the accuracy of N:M sparse models.<n>We propose PermLLM, a novel post-training pruning framework that introduces learnable channel permutation.<n>We show that PermLLM achieves superior performance in optimizing N:M sparse models.
Score: 44.32585496684303
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Channel permutation is a powerful technique for enhancing the accuracy of N:M sparse models by reordering the channels of weight matrices to prioritize the retention of important weights. However, traditional channel permutation methods rely on handcrafted quality metrics, which often fail to accurately capture the true impact of pruning on model performance. To address this limitation, we propose PermLLM, a novel post-training pruning framework that introduces learnable channel permutation (LCP) for N:M sparsity. LCP leverages Sinkhorn normalization to transform discrete permutation matrices into differentiable soft permutation matrices, enabling end-to-end optimization. Additionally, PermLLM incorporates an efficient block-wise channel permutation strategy, which significantly reduces the number of learnable parameters and computational complexity. PermLLM seamlessly integrates with existing one-shot pruning methods to adaptively optimize channel permutations, effectively mitigating pruning-induced errors. Extensive experiments on the LLaMA series, Qwen, and OPT models demonstrate that PermLLM achieves superior performance in optimizing N:M sparse models. The code is available at https://github.com/lanchengzou/PermLLM.

Related papers

Learnable Permutation for Structured Sparsity on Transformer Models [17.777454274409912]
Structured sparsity has emerged as a popular model pruning technique.<n>Weight permutation is a promising direction to further improve post-pruning performance.<n>We propose a novel end-to-end learnable permutation framework.
arXiv Detail & Related papers (2026-01-30T13:44:00Z)
TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks [12.33715367032615]
Network pruning reduces the computational requirements of large neural networks.<n>N:M sparsity retains only N out of every M consecutive weights.<n>Transposable N:M sparsity has been proposed to address this limitation.
arXiv Detail & Related papers (2025-05-29T18:59:43Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We propose a new family of algorithms that uses the linear minimization oracle (LMO) to adapt to the geometry of the problem.<n>We demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam.
arXiv Detail & Related papers (2025-02-11T13:10:34Z)
Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage. To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices. A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv Detail & Related papers (2024-11-04T04:58:20Z)
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [63.10833446782114]
As language models grow in size, memory demands for backpropagation increase.<n>Zeroth-order (ZO) optimization methods offer a memory-efficient alternative.<n>In this paper, we propose Subspace Zero-order optimization to address the challenges posed by posed by high dimensionality perturbations.
arXiv Detail & Related papers (2024-10-11T17:01:43Z)
Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs [1.3124513975412255]
N:M sparsity pruning is a powerful technique for compressing deep neural networks. We introduce a channel permutation method designed specifically for HiNM sparsity, named gyro-permutation.
arXiv Detail & Related papers (2024-07-30T01:40:50Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
Improving generalization in large language models by learning prefix subspaces [5.911540700785975]
This article focuses on large language models (LLMs) fine-tuning in the scarce data regime (also known as the "few-shot" learning setting) We propose a method to increase the generalization capabilities of LLMs based on neural network subspaces.
arXiv Detail & Related papers (2023-10-24T12:44:09Z)
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models [44.515165695546614]
Quantization-Aware Training (QAT) offers a solution, but its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for Large Language Models (LLMs) We propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs.
arXiv Detail & Related papers (2023-10-12T05:25:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.