Related papers: MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

URL: http://arxiv.org/abs/2409.17481v1
Date: Thu, 26 Sep 2024 02:37:41 GMT
Title: MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models
Authors: Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, Xinchao Wang
Abstract summary: This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or N:M'') Sparsity in Large Language Models. MaskLLM explicitly models N:M patterns as a learnable distribution through Gumbel Softmax sampling.
Score: 91.4190318047519
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are distinguished by their massive parameter counts, which typically result in significant redundancy. This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or ``N:M'') Sparsity in LLMs, aimed at reducing computational overhead during inference. Instead of developing a new importance criterion, MaskLLM explicitly models N:M patterns as a learnable distribution through Gumbel Softmax sampling. This approach facilitates end-to-end training on large-scale datasets and offers two notable advantages: 1) High-quality Masks - our method effectively scales to large datasets and learns accurate masks; 2) Transferability - the probabilistic modeling of mask distribution enables the transfer learning of sparsity across domains or tasks. We assessed MaskLLM using 2:4 sparsity on various LLMs, including LLaMA-2, Nemotron-4, and GPT-3, with sizes ranging from 843M to 15B parameters, and our empirical results show substantial improvements over state-of-the-art methods. For instance, leading approaches achieve a perplexity (PPL) of 10 or greater on Wikitext compared to the dense model's 5.12 PPL, but MaskLLM achieves a significantly lower 6.72 PPL solely by learning the masks with frozen weights. Furthermore, MaskLLM's learnable nature allows customized masks for lossless application of 2:4 sparsity to downstream tasks or domains. Code is available at \url{https://github.com/NVlabs/MaskLLM}.

Related papers

Exact Unlearning of Finetuning Data via Model Merging at Scale [27.352216338702565]
We propose SIFT-Masks, an exact unlearning method based on model merging. Across four settings where we merge up to 500 models, SIFT-Masks improves accuracy by 5-80% over naive merging.
arXiv Detail & Related papers (2025-04-06T21:24:29Z)
Boosting Large Language Models with Mask Fine-Tuning [60.56962908455601]
We introduce Mask Fine-Tuning (MFT) to show that properly breaking the integrity of the model can surprisingly lead to improved performance. Experiments show that MFT gains a consistent performance boost across various domains and backbones.
arXiv Detail & Related papers (2025-03-27T20:17:57Z)
Triple Point Masking [49.39218611030084]
Existing 3D mask learning methods encounter performance bottlenecks under limited data. We introduce a triple point masking scheme, named TPM, which serves as a scalable framework for pre-training of masked autoencoders. Extensive experiments show that the four baselines equipped with the proposed TPM achieve comprehensive performance improvements on various downstream tasks.
arXiv Detail & Related papers (2024-09-26T05:33:30Z)
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework. We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise. We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Random Masking Finds Winning Tickets for Parameter Efficient Fine-tuning [17.638387297838936]
Fine-tuning large language models (LLM) can be costly. PEFT addresses the problems by training a fraction of the parameters, whose success reveals the expressiveness and flexibility of pretrained models. This paper studies the limit of PEFT, by further simplifying its design and reducing the number of trainable parameters beyond standard setups. We show that Random Masking is surprisingly effective: with a larger-than-expected learning rate, Random Masking can match the performance of standard PEFT algorithms on various tasks, using fewer trainable parameters.
arXiv Detail & Related papers (2024-05-04T07:44:18Z)
SLM: End-to-end Feature Selection via Sparse Learnable Masks [12.081877372552606]
We propose a canonical approach for end-to-end feature selection that scales well with respect to both the feature dimension and the number of samples. At the heart of SLM lies a simple but effective learnable sparse mask, which learns which features to select. We derive a scaling mechanism that allows SLM to precisely control the number of features selected.
arXiv Detail & Related papers (2023-04-06T16:25:43Z)
Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning. We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z)
Mask Transfiner for High-Quality Instance Segmentation [95.74244714914052]
We present Mask Transfiner for high-quality and efficient instance segmentation. Our approach only processes detected error-prone tree nodes and self-corrects their errors in parallel. Our code and trained models will be available at http://vis.xyz/pub/transfiner.
arXiv Detail & Related papers (2021-11-26T18:58:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.