MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on Large Language Models
- URL: http://arxiv.org/abs/2506.12876v1
- Date: Sun, 15 Jun 2025 15:02:59 GMT
- Title: MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on Large Language Models
- Authors: Yan Sun, Qixin Zhang, Zhiyuan Yu, Xikun Zhang, Li Shen, Dacheng Tao,
- Abstract summary: Semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights.<n>Existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven learning, which incurs prohibitive training costs.<n>We propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling
- Score: 53.36415620647177
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid scaling of large language models (LLMs) has made inference efficiency a primary bottleneck in the practical deployment. To address this, semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights, thereby enabling hardware-friendly acceleration and reduced memory. However, existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven combinatorial learning, which incurs prohibitive training costs. To tackle these challenges, we propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling without replacement. Furthermore, to mitigate the training instability induced by the high variance of policy gradients in the super large combinatorial space, we propose a novel update method by introducing a moving average tracker of loss residuals instead of vanilla loss. Finally, we conduct comprehensive theoretical analysis and extensive experiments to validate the superior performance of MaskPro, as well as its excellent scalability in memory efficiency and exceptional robustness to data samples. Our code is available at https://github.com/woodenchild95/Maskpro.git.
Related papers
- Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update [60.414548453838506]
We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function.<n>GLBs are widely applicable to real-world scenarios, but their non-linear nature introduces significant challenges in achieving both computational and statistical efficiency.<n>We propose a jointly efficient algorithm that attains a nearly optimal regret bound with $mathcalO(1)$ time and space complexities per round.
arXiv Detail & Related papers (2025-07-16T02:24:21Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Sparse Linear Bandits with Blocking Constraints [22.01704171400845]
We investigate the high-dimensional sparse linear bandits problem in a data-poor regime.<n>We show novel offline statistical guarantees of the lasso estimator for the linear model.<n>We propose a meta-algorithm based on corralling that does not need knowledge of optimal sparsity parameter $k$ at minimal cost to regret.
arXiv Detail & Related papers (2024-10-26T01:42:03Z) - Self-Supervised Scalable Deep Compressed Sensing [24.854496459622787]
Compressed sensing is a promising tool for reducing sampling costs.
Current deep neural network (NN)-based CS methods face the challenges of collecting labeled measurement-ground truth (GT) data.
This paper proposes a novel $mathbfS$elf-supervised s$mathbfC$alable deep CS method.
arXiv Detail & Related papers (2023-08-26T06:03:06Z) - One-Shot Online Testing of Deep Neural Networks Based on Distribution
Shift Detection [0.6091702876917281]
We propose a emphone-shot testing approach that can test NNs accelerated on memristive crossbars with only one test vector.
Our approach can consistently achieve $100%$ fault coverage across several large topologies.
arXiv Detail & Related papers (2023-05-16T11:06:09Z) - Supervised Contrastive Prototype Learning: Augmentation Free Robust
Neural Network [17.10753224600936]
Transformations in the input space of Deep Neural Networks (DNN) lead to unintended changes in the feature space.
We propose a training framework, $textbfd Contrastive Prototype Learning$ ( SCPL)
We use N-pair contrastive loss with prototypes of the same and opposite classes and replace a categorical classification head with a $textbfPrototype Classification Head$ (PCH)
Our approach is $textitsample efficient$, does not require $textitsample mining$, can be implemented on any existing DNN without modification to their
arXiv Detail & Related papers (2022-11-26T01:17:15Z) - When are Local Queries Useful for Robust Learning? [25.832511407411637]
We study learning models where the learner is given more power through the use of local queries.
We give the first distribution-free algorithms that perform robust empirical risk minimization.
We finish by giving robust learning algorithms for halfspaces on $0,1n$ and then obtaining robustness guarantees for halfspaces in $mathbbRn$ against precision-bounded adversaries.
arXiv Detail & Related papers (2022-10-12T11:04:22Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z) - Towards Deep Learning Models Resistant to Large Perturbations [0.0]
Adversarial robustness has proven to be a required property of machine learning algorithms.
We show that the well-established algorithm called "adversarial training" fails to train a deep neural network given a large, but reasonable, perturbation magnitude.
arXiv Detail & Related papers (2020-03-30T12:03:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.