SliceGPT: Compress Large Language Models by Deleting Rows and Columns
- URL: http://arxiv.org/abs/2401.15024v2
- Date: Fri, 9 Feb 2024 17:59:40 GMT
- Title: SliceGPT: Compress Large Language Models by Deleting Rows and Columns
- Authors: Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento,
Torsten Hoefler, James Hensman
- Abstract summary: We present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network.
We show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance.
- Score: 27.004657436024853
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have become the cornerstone of natural language
processing, but their use comes with substantial costs in terms of compute and
memory resources. Sparsification provides a solution to alleviate these
resource constraints, and recent works have shown that trained models can be
sparsified post-hoc. Existing sparsification techniques face challenges as they
need additional data structures and offer constrained speedup with current
hardware. In this paper we present SliceGPT, a new post-training sparsification
scheme which replaces each weight matrix with a smaller (dense) matrix,
reducing the embedding dimension of the network. Through extensive
experimentation, we show that SliceGPT can remove up to 25% of the model
parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models
while maintaining 99%, 99% and 90% zero-shot task performance of the dense
model respectively. Our sliced models run on fewer GPUs and run faster without
any additional code optimization: on 24GB consumer GPUs we reduce the total
compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB
A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance
in transformer networks, which enables SliceGPT and we hope it will inspire and
enable future avenues to reduce memory and computation demands for pre-trained
models. Code is available at:
https://github.com/microsoft/TransformerCompression
Related papers
- Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients [24.58231358634904]
Large language model (LLM) training and finetuning are often bottlenecked by limited GPU memory.
We propose Grass (GRAdient Stuctured Sparsification), a novel approach that leverages sparse projections to transform gradients into structured sparse updates.
arXiv Detail & Related papers (2024-06-25T15:50:32Z) - Scalable MatMul-free Language Modeling [8.672867887354977]
We show that MatMul operations can be completely eliminated from large language models.
Our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models
for Financial Applications with High-Performance Computing [10.47214968497857]
We present high-performance methods that exploit low-rank structures to pretrain and finetune large language models.
Our methods achieve a speedup of 1.3X and a model compression ratio of 2.64X for pretaining without accuracy drop.
For finetuning, our methods achieve an average accuracy increase of 6.3% and 24.0% in general tasks and financial tasks.
arXiv Detail & Related papers (2024-02-21T05:03:17Z) - Reusing Pretrained Models by Multi-linear Operators for Efficient
Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources.
Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model.
We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z) - Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training.
We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z) - EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning.
It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning.
Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z) - Nimble GNN Embedding with Tensor-Train Decomposition [10.726368002799765]
This paper describes a new method for representing embedding tables of graph neural networks (GNNs) more compactly via tensor-train (TT) decomposition.
In some cases, our model without explicit node features on input can even match the accuracy of models that use node features.
arXiv Detail & Related papers (2022-06-21T17:57:35Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models.
Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z) - Efficient Large-Scale Language Model Training on GPU Clusters [19.00915720435389]
Large language models have led to state-of-the-art accuracies across a range of tasks.
Memory capacity is limited, making it impossible to fit large models on a single GPU.
The number of compute operations required to train these models can result in unrealistically long training times.
arXiv Detail & Related papers (2021-04-09T16:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.