Just CHOP: Embarrassingly Simple LLM Compression
- URL: http://arxiv.org/abs/2305.14864v3
- Date: Tue, 9 Jul 2024 21:09:38 GMT
- Title: Just CHOP: Embarrassingly Simple LLM Compression
- Authors: Ananya Harsh Jha, Tom Sherborne, Evan Pete Walsh, Dirk Groeneveld, Emma Strubell, Iz Beltagy,
- Abstract summary: Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint.
We show that simple layer pruning coupled with an extended language model pretraining produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale.
We also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.
- Score: 27.64461490974072
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint. A growing assortment of methods for compression promises to reduce the computational burden of LLMs in deployment, but so far, only quantization approaches have been demonstrated to be effective for LLM compression while maintaining zero-shot performance. A critical step in the compression process, the pretrain-then-finetune paradigm, has largely been overlooked when adapting existing pruning strategies to LLMs or proposing new ones. In this work, we show that embarrassingly simple layer pruning coupled with an extended language model pretraining as the finetuning phase produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale while being more inference efficient. We call this method LayerChop, where we deterministically remove layers from a model followed by task-agnostic finetuning of the remaining weights by continued self-supervised pretraining. At this scale, we also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.
Related papers
- LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards
General Neural Parameter Prior Models [66.1595537904019]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.
We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [17.381160429641316]
We propose a pruning pipeline for semi-structured sparse models via retraining, termed Adaptive Sparse Trainer (AST)
AST transforms dense models into sparse ones by applying decay to masked weights while allowing the model to adaptively select masks throughout the training process.
Our work demonstrates the feasibility of deploying semi-structured sparse large language models and introduces a novel method for achieving highly compressed models.
arXiv Detail & Related papers (2024-07-30T06:33:44Z) - In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs)
We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings.
Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z) - SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models.
Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z) - Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization [40.15915011575071]
Low-rank compression is a promising technique to reduce non-essential parameters in large language models.
We conduct empirical research on the low-rank characteristics of large models.
We propose a low-rank compression method suitable for large language models.
arXiv Detail & Related papers (2024-05-17T08:27:12Z) - MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with
Module-wise Pruning Error Metric [57.3330687266266]
We find that using smaller pre-trained models and applying magnitude-based pruning on CLIP models leads to inflexibility and inferior performance.
Using the Module-wise Pruning Error (MoPE) metric, we introduce a unified pruning framework applicable to both pre-training and task-specific fine-tuning compression stages.
arXiv Detail & Related papers (2024-03-12T17:24:26Z) - Rethinking Compression: Reduced Order Modelling of Latent Features in
Large Language Models [9.91972450276408]
This paper introduces an innovative approach for the parametric and practical compression of Large Language Models (LLMs) based on reduced order modelling.
Our method represents a significant advancement in model compression by leveraging matrix decomposition, demonstrating superior efficacy compared to the prevailing state-of-the-art structured pruning method.
arXiv Detail & Related papers (2023-12-12T07:56:57Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.