ZipLM: Inference-Aware Structured Pruning of Language Models
- URL: http://arxiv.org/abs/2302.04089v2
- Date: Thu, 26 Oct 2023 06:42:40 GMT
- Title: ZipLM: Inference-Aware Structured Pruning of Language Models
- Authors: Eldar Kurtic, Elias Frantar, Dan Alistarh
- Abstract summary: We propose a novel structured compression approach for large language models (LLMs) called ZipLM.
ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups.
ZipLM produces state-of-the-art compressed models across all settings.
- Score: 56.52030193434863
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The breakthrough performance of large language models (LLMs) comes with major
computational footprints and high deployment costs. In this paper, we progress
towards resolving this problem by proposing a novel structured compression
approach for LLMs, called ZipLM. ZipLM achieves state-of-the-art
accuracy-vs-speedup, while matching a set of desired target runtime speedups in
any given inference environment. Specifically, given a model, a dataset, an
inference environment, as well as a set of speedup targets, ZipLM iteratively
identifies and removes components with the worst loss-runtime trade-off. Unlike
prior methods that specialize in either the post-training/one-shot or the
gradual compression setting, and only for specific families of models such as
BERT (encoder) or GPT (decoder), ZipLM produces state-of-the-art compressed
models across all these settings. Furthermore, ZipLM achieves superior results
for a fraction of the computational cost relative to prior distillation and
pruning techniques, making it a cost-effective approach for generating an
entire family of smaller, faster, and highly accurate models, guaranteed to
meet the desired inference specifications. In particular, ZipLM outperforms all
prior BERT-base distillation and pruning techniques, such as CoFi, MiniLM, and
TinyBERT. Moreover, it matches the performance of the heavily optimized
MobileBERT model, obtained via extensive architecture search, by simply pruning
the baseline BERT-large model. When compressing GPT2, ZipLM outperforms
DistilGPT2 while being 60% smaller and 30% faster. Our code is available at:
https://github.com/IST-DASLab/ZipLM.
Related papers
- Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging [14.123313596780726]
We propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA)
MKA uses manifold learning and the Normalized Pairwise Information Bottleneck measure to merge similar layers, reducing model size while preserving essential performance.
Our findings show that MKA not only preserves model performance but also achieves substantial compression ratios, outperforming traditional pruning methods.
arXiv Detail & Related papers (2024-06-24T05:57:55Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
AQLM is first scheme that is optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter.
We provide fast GPU and CPU implementations of AQLM for token generation.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - LLMLingua: Compressing Prompts for Accelerated Inference of Large
Language Models [22.06402870816756]
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities.
This paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity.
We show that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss.
arXiv Detail & Related papers (2023-10-09T14:10:21Z) - Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models.
LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods.
LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z) - SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z) - Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM
Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models.
We propose a soft prompt learning method where we expose the compressed model to the prompt learning process.
Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z) - You Only Compress Once: Towards Effective and Elastic BERT Compression
via Exploit-Explore Stochastic Nature Gradient [88.58536093633167]
Existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments.
We propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere.
Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1%-4.5% average accuracy improvement on the GLUE benchmark.
arXiv Detail & Related papers (2021-06-04T12:17:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.