Learnable Permutation for Structured Sparsity on Transformer Models
- URL: http://arxiv.org/abs/2601.22980v1
- Date: Fri, 30 Jan 2026 13:44:00 GMT
- Title: Learnable Permutation for Structured Sparsity on Transformer Models
- Authors: Zekai Li, Ji Liu, Guanchen Li, Yixing Xu, Ziqiong Liu, Xuanwu Yin, Dong Li, Emad Barsoum,
- Abstract summary: Structured sparsity has emerged as a popular model pruning technique.<n>Weight permutation is a promising direction to further improve post-pruning performance.<n>We propose a novel end-to-end learnable permutation framework.
- Score: 17.777454274409912
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Structured sparsity has emerged as a popular model pruning technique, widely adopted in various architectures, including CNNs, Transformer models, and especially large language models (LLMs) in recent years. A promising direction to further improve post-pruning performance is weight permutation, which reorders model weights into patterns more amenable to pruning. However, the exponential growth of the permutation search space with the scale of Transformer architectures forces most methods to rely on greedy or heuristic algorithms, limiting the effectiveness of reordering. In this work, we propose a novel end-to-end learnable permutation framework. Our method introduces a learnable permutation cost matrix to quantify the cost of swapping any two input channels of a given weight matrix, a differentiable bipartite matching solver to obtain the optimal binary permutation matrix given a cost matrix, and a sparsity optimization loss function to directly optimize the permutation operator. We extensively validate our approach on vision and language Transformers, demonstrating that our method achieves state-of-the-art permutation results for structured sparsity.
Related papers
- PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models [44.32585496684303]
Channel permutation is a powerful technique for enhancing the accuracy of N:M sparse models.<n>We propose PermLLM, a novel post-training pruning framework that introduces learnable channel permutation.<n>We show that PermLLM achieves superior performance in optimizing N:M sparse models.
arXiv Detail & Related papers (2025-10-11T09:40:27Z) - Structure As Search: Unsupervised Permutation Learning for Combinatorial Optimization [22.51828421737954]
We propose a non-autoregressive framework for the Travelling Salesman Problem.<n>By applying a similarity transformation to Hamiltonian cycles, the model learns to approximate permutation via continuous relaxations.
arXiv Detail & Related papers (2025-07-05T21:17:38Z) - MatRL: Provably Generalizable Iterative Algorithm Discovery via Monte-Carlo Tree Search [37.24058519921229]
MatRL is a reinforcement learning framework that automatically discovers iterative algorithms for computing matrix functions.<n>We show that MatRL produces algorithms that outperform various baselines in the literature.
arXiv Detail & Related papers (2025-07-04T22:57:33Z) - REOrdering Patches Improves Vision Models [58.8295093799148]
We show that patch order significantly affects model performance in such settings.<n>We propose REOrder, a framework for discovering task-optimal patch orderings.<n>ReOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.
arXiv Detail & Related papers (2025-05-29T17:59:30Z) - New Permutation Decomposition Techniques for Efficient Homomorphic Permutation [33.30918602217202]
Homomorphic permutation is fundamental to privacy-preserving computations based on batch-encoding homomorphic encryption.<n>We propose novel decomposition techniques to optimize homomorphic permutations, advancing homomorphic encryption-based privacy-preserving computations.<n>We design a network structure that deviates from the conventional scope of decomposition and outperforms the state-of-the-art technique with a speed-up of up to $1.69times$ under a minimal rotation key requirement.
arXiv Detail & Related papers (2024-10-29T08:07:22Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - Discovering Non-monotonic Autoregressive Orderings with Variational
Inference [67.27561153666211]
We develop an unsupervised parallelizable learner that discovers high-quality generation orders purely from training data.
We implement the encoder as a Transformer with non-causal attention that outputs permutations in one forward pass.
Empirical results in language modeling tasks demonstrate that our method is context-aware and discovers orderings that are competitive with or even better than fixed orders.
arXiv Detail & Related papers (2021-10-27T16:08:09Z) - Structured Reordering for Modeling Latent Alignments in Sequence
Transduction [86.94309120789396]
We present an efficient dynamic programming algorithm performing exact marginal inference of separable permutations.
The resulting seq2seq model exhibits better systematic generalization than standard models on synthetic problems and NLP tasks.
arXiv Detail & Related papers (2021-06-06T21:53:54Z) - Batch Bayesian Optimization on Permutations using Acquisition Weighted
Kernels [86.11176756341114]
We introduce LAW, a new efficient batch acquisition method based on the determinantal point process.
We provide a regret analysis for our method to gain insight in its theoretical properties.
We evaluate the method on several standard problems involving permutations such as quadratic assignment.
arXiv Detail & Related papers (2021-02-26T10:15:57Z) - Multi-View Spectral Clustering with High-Order Optimal Neighborhood
Laplacian Matrix [57.11971786407279]
Multi-view spectral clustering can effectively reveal the intrinsic cluster structure among data.
This paper proposes a multi-view spectral clustering algorithm that learns a high-order optimal neighborhood Laplacian matrix.
Our proposed algorithm generates the optimal Laplacian matrix by searching the neighborhood of the linear combination of both the first-order and high-order base.
arXiv Detail & Related papers (2020-08-31T12:28:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.