ZipLM: Inference-Aware Structured Pruning of Language Models
- URL: http://arxiv.org/abs/2302.04089v2
- Date: Thu, 26 Oct 2023 06:42:40 GMT
- Title: ZipLM: Inference-Aware Structured Pruning of Language Models
- Authors: Eldar Kurtic, Elias Frantar, Dan Alistarh
- Abstract summary: We propose a novel structured compression approach for large language models (LLMs) called ZipLM.
ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups.
ZipLM produces state-of-the-art compressed models across all settings.
- Score: 56.52030193434863
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The breakthrough performance of large language models (LLMs) comes with major
computational footprints and high deployment costs. In this paper, we progress
towards resolving this problem by proposing a novel structured compression
approach for LLMs, called ZipLM. ZipLM achieves state-of-the-art
accuracy-vs-speedup, while matching a set of desired target runtime speedups in
any given inference environment. Specifically, given a model, a dataset, an
inference environment, as well as a set of speedup targets, ZipLM iteratively
identifies and removes components with the worst loss-runtime trade-off. Unlike
prior methods that specialize in either the post-training/one-shot or the
gradual compression setting, and only for specific families of models such as
BERT (encoder) or GPT (decoder), ZipLM produces state-of-the-art compressed
models across all these settings. Furthermore, ZipLM achieves superior results
for a fraction of the computational cost relative to prior distillation and
pruning techniques, making it a cost-effective approach for generating an
entire family of smaller, faster, and highly accurate models, guaranteed to
meet the desired inference specifications. In particular, ZipLM outperforms all
prior BERT-base distillation and pruning techniques, such as CoFi, MiniLM, and
TinyBERT. Moreover, it matches the performance of the heavily optimized
MobileBERT model, obtained via extensive architecture search, by simply pruning
the baseline BERT-large model. When compressing GPT2, ZipLM outperforms
DistilGPT2 while being 60% smaller and 30% faster. Our code is available at:
https://github.com/IST-DASLab/ZipLM.
Related papers
- LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment [36.958867918858296]
Large language models (LLMs) have demonstrated their strong intelligence ability, but high demand for computation and storage hinders their practical application.
We present the Large Language Model Compression Benchmark (LLMCBench), a rigorously designed benchmark with an in-depth analysis for LLM compression algorithms.
arXiv Detail & Related papers (2024-10-28T14:45:01Z) - Two are better than one: Context window extension with multi-grained self-injection [111.1376461868317]
SharedLLM is a novel approach grounded in the design philosophy of multi-grained context compression and query-aware information retrieval.
We introduce a specialized tree-style data structure to efficiently encode, store and retrieve multi-grained contextual information for text chunks.
arXiv Detail & Related papers (2024-10-25T06:08:59Z) - Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations.
Post-training pruning methods are proposed to prune LLMs in one-shot without retraining.
Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z) - EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search [33.86918407429272]
We propose a new and general approach for dynamic compression that is provably optimal in a given input range.
We show that these theoretical guarantees lead to highly competitive practical performance for dynamic compression of Llama, Mistral and Phi models.
arXiv Detail & Related papers (2024-10-18T17:46:37Z) - SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching [32.4599581528901]
"Two-tower" architecture is used for compressing pre-trained LLM parameters into compact representations and fine-tuning the additive full-precision adapter.
We propose SpaLLM (Sketched Adapting of LLMs), a novel compressive adaptation approach for LLMs.
We show that SpaLLM sketches pre-trained LLM weights into lookup tables and directly fine-tunes the values in these tables.
arXiv Detail & Related papers (2024-10-08T20:58:24Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging [14.123313596780726]
We propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA)
MKA uses manifold learning and the Normalized Pairwise Information Bottleneck measure to merge similar layers, reducing model size while preserving essential performance.
Our findings show that MKA not only preserves model performance but also achieves substantial compression ratios, outperforming traditional pruning methods.
arXiv Detail & Related papers (2024-06-24T05:57:55Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM
Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models.
We propose a soft prompt learning method where we expose the compressed model to the prompt learning process.
Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z) - You Only Compress Once: Towards Effective and Elastic BERT Compression
via Exploit-Explore Stochastic Nature Gradient [88.58536093633167]
Existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments.
We propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere.
Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1%-4.5% average accuracy improvement on the GLUE benchmark.
arXiv Detail & Related papers (2021-06-04T12:17:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.