MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models
- URL: http://arxiv.org/abs/2409.17481v1
- Date: Thu, 26 Sep 2024 02:37:41 GMT
- Title: MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models
- Authors: Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff
Pool, Jan Kautz, Pavlo Molchanov, Xinchao Wang
- Abstract summary: This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or N:M'') Sparsity in Large Language Models.
MaskLLM explicitly models N:M patterns as a learnable distribution through Gumbel Softmax sampling.
- Score: 91.4190318047519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are distinguished by their massive parameter
counts, which typically result in significant redundancy. This work introduces
MaskLLM, a learnable pruning method that establishes Semi-structured (or
``N:M'') Sparsity in LLMs, aimed at reducing computational overhead during
inference. Instead of developing a new importance criterion, MaskLLM explicitly
models N:M patterns as a learnable distribution through Gumbel Softmax
sampling. This approach facilitates end-to-end training on large-scale datasets
and offers two notable advantages: 1) High-quality Masks - our method
effectively scales to large datasets and learns accurate masks; 2)
Transferability - the probabilistic modeling of mask distribution enables the
transfer learning of sparsity across domains or tasks. We assessed MaskLLM
using 2:4 sparsity on various LLMs, including LLaMA-2, Nemotron-4, and GPT-3,
with sizes ranging from 843M to 15B parameters, and our empirical results show
substantial improvements over state-of-the-art methods. For instance, leading
approaches achieve a perplexity (PPL) of 10 or greater on Wikitext compared to
the dense model's 5.12 PPL, but MaskLLM achieves a significantly lower 6.72 PPL
solely by learning the masks with frozen weights. Furthermore, MaskLLM's
learnable nature allows customized masks for lossless application of 2:4
sparsity to downstream tasks or domains. Code is available at
\url{https://github.com/NVlabs/MaskLLM}.
Related papers
- Triple Point Masking [49.39218611030084]
Existing 3D mask learning methods encounter performance bottlenecks under limited data.
We introduce a triple point masking scheme, named TPM, which serves as a scalable framework for pre-training of masked autoencoders.
Extensive experiments show that the four baselines equipped with the proposed TPM achieve comprehensive performance improvements on various downstream tasks.
arXiv Detail & Related papers (2024-09-26T05:33:30Z) - ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework.
We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise.
We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - Random Masking Finds Winning Tickets for Parameter Efficient Fine-tuning [17.638387297838936]
Fine-tuning large language models (LLM) can be costly.
PEFT addresses the problems by training a fraction of the parameters, whose success reveals the expressiveness and flexibility of pretrained models.
This paper studies the limit of PEFT, by further simplifying its design and reducing the number of trainable parameters beyond standard setups.
We show that Random Masking is surprisingly effective: with a larger-than-expected learning rate, Random Masking can match the performance of standard PEFT algorithms on various tasks, using fewer trainable parameters.
arXiv Detail & Related papers (2024-05-04T07:44:18Z) - SLM: End-to-end Feature Selection via Sparse Learnable Masks [12.081877372552606]
We propose a canonical approach for end-to-end feature selection that scales well with respect to both the feature dimension and the number of samples.
At the heart of SLM lies a simple but effective learnable sparse mask, which learns which features to select.
We derive a scaling mechanism that allows SLM to precisely control the number of features selected.
arXiv Detail & Related papers (2023-04-06T16:25:43Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - Mask Transfiner for High-Quality Instance Segmentation [95.74244714914052]
We present Mask Transfiner for high-quality and efficient instance segmentation.
Our approach only processes detected error-prone tree nodes and self-corrects their errors in parallel.
Our code and trained models will be available at http://vis.xyz/pub/transfiner.
arXiv Detail & Related papers (2021-11-26T18:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.