SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
- URL: http://arxiv.org/abs/2301.00774v3
- Date: Wed, 22 Mar 2023 12:33:46 GMT
- Title: SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
- Authors: Elias Frantar, Dan Alistarh
- Abstract summary: We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot.
This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models.
We can execute SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, in under 4.5 hours.
- Score: 29.284147465251685
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We show for the first time that large-scale generative pretrained transformer
(GPT) family models can be pruned to at least 50% sparsity in one-shot, without
any retraining, at minimal loss of accuracy. This is achieved via a new pruning
method called SparseGPT, specifically designed to work efficiently and
accurately on massive GPT-family models. We can execute SparseGPT on the
largest available open-source models, OPT-175B and BLOOM-176B, in under 4.5
hours, and can reach 60% unstructured sparsity with negligible increase in
perplexity: remarkably, more than 100 billion weights from these models can be
ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and
4:8) patterns, and is compatible with weight quantization approaches. The code
is available at: https://github.com/IST-DASLab/sparsegpt.
Related papers
- Optimizing Large Model Training through Overlapped Activation Recomputation [24.461674158317578]
Existing recomputation approaches may incur up to 40% overhead when training real-world models.
This is because they are executed on demand in the critical training path.
We design a new recomputation framework, Lynx, to reduce the overhead by overlapping the recomputation with communication occurring in training pipelines.
arXiv Detail & Related papers (2024-06-13T02:31:36Z) - Non-Vacuous Generalization Bounds for Large Language Models [78.42762571499061]
We provide the first non-vacuous generalization bounds for pretrained large language models.
We show that larger models have better generalization bounds and are more compressible than smaller models.
arXiv Detail & Related papers (2023-12-28T17:58:42Z) - Pruning Large Language Models via Accuracy Predictor [0.0]
Large language models (LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks.
We propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor.
arXiv Detail & Related papers (2023-09-18T06:38:24Z) - Efficient GPT Model Pre-training using Tensor Train Matrix
Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch.
To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure.
The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z) - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained
Transformers [34.91478831993398]
GPTQ is a new one-shot weight quantization method based on approximate second-order information.
It can quantize GPT models with 175 billion parameters in approximately four GPU hours.
Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods.
arXiv Detail & Related papers (2022-10-31T13:42:40Z) - CrAM: A Compression-Aware Minimizer [103.29159003723815]
We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way.
CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning.
CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
arXiv Detail & Related papers (2022-07-28T16:13:28Z) - OPT: Open Pre-trained Transformer Language Models [99.60254017109551]
We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters.
We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop.
arXiv Detail & Related papers (2022-05-02T17:49:50Z) - ByT5 model for massively multilingual grapheme-to-phoneme conversion [13.672109728462663]
We tackle massively multilingual grapheme-to-phoneme conversion through implementing G2P models based on ByT5.
We found that ByT5 operating on byte-level inputs significantly outperformed the token-based mT5 model in terms of multilingual G2P.
arXiv Detail & Related papers (2022-04-06T20:03:38Z) - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model)
GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z) - Kronecker Decomposition for GPT Compression [8.60086973058282]
GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain.
Despite the superior performance of GPT, GPT can be very prohibitive for deploying this model on devices with limited computational power or memory.
In this work, we use Kronecker decomposition to compress the linear mappings of the GPT-22 model.
arXiv Detail & Related papers (2021-10-15T15:28:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.