Related papers: ZipLM: Inference-Aware Structured Pruning of Language Models

ZipLM: Inference-Aware Structured Pruning of Language Models

URL: http://arxiv.org/abs/2302.04089v2
Date: Thu, 26 Oct 2023 06:42:40 GMT
Title: ZipLM: Inference-Aware Structured Pruning of Language Models
Authors: Eldar Kurtic, Elias Frantar, Dan Alistarh
Abstract summary: We propose a novel structured compression approach for large language models (LLMs) called ZipLM. ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups. ZipLM produces state-of-the-art compressed models across all settings.
Score: 56.52030193434863
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The breakthrough performance of large language models (LLMs) comes with major computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a novel structured compression approach for LLMs, called ZipLM. ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups in any given inference environment. Specifically, given a model, a dataset, an inference environment, as well as a set of speedup targets, ZipLM iteratively identifies and removes components with the worst loss-runtime trade-off. Unlike prior methods that specialize in either the post-training/one-shot or the gradual compression setting, and only for specific families of models such as BERT (encoder) or GPT (decoder), ZipLM produces state-of-the-art compressed models across all these settings. Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications. In particular, ZipLM outperforms all prior BERT-base distillation and pruning techniques, such as CoFi, MiniLM, and TinyBERT. Moreover, it matches the performance of the heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT-large model. When compressing GPT2, ZipLM outperforms DistilGPT2 while being 60% smaller and 30% faster. Our code is available at: https://github.com/IST-DASLab/ZipLM.

Related papers

Progressive Binarization with Semi-Structured Pruning for LLMs [36.32239429974179]
Large language models (LLMs) have achieved remarkable success in natural language processing tasks. Their high computational and memory demands pose challenges for deployment on resource-constrained devices. We propose a Progressive Binarization with Semi-Structured Pruning (PBS$2$P) method for LLM compression.
arXiv Detail & Related papers (2025-02-03T13:30:29Z)
LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment [36.958867918858296]
Large language models (LLMs) have demonstrated their strong intelligence ability, but high demand for computation and storage hinders their practical application. We present the Large Language Model Compression Benchmark (LLMCBench), a rigorously designed benchmark with an in-depth analysis for LLM compression algorithms.
arXiv Detail & Related papers (2024-10-28T14:45:01Z)
Two are better than one: Context window extension with multi-grained self-injection [111.1376461868317]
SharedLLM is a novel approach grounded in the design philosophy of multi-grained context compression and query-aware information retrieval. We introduce a specialized tree-style data structure to efficiently encode, store and retrieve multi-grained contextual information for text chunks.
arXiv Detail & Related papers (2024-10-25T06:08:59Z)
Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. Post-training pruning methods are proposed to prune LLMs in one-shot without retraining. Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z)
EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search [33.86918407429272]
We propose a new and general approach for dynamic compression that is provably optimal in a given input range. We show that these theoretical guarantees lead to highly competitive practical performance for dynamic compression of Llama, Mistral and Phi models.
arXiv Detail & Related papers (2024-10-18T17:46:37Z)
SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching [32.4599581528901]
"Two-tower" architecture is used for compressing pre-trained LLM parameters into compact representations and fine-tuning the additive full-precision adapter. We propose SpaLLM (Sketched Adapting of LLMs), a novel compressive adaptation approach for LLMs. We show that SpaLLM sketches pre-trained LLM weights into lookup tables and directly fine-tunes the values in these tables.
arXiv Detail & Related papers (2024-10-08T20:58:24Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging [14.123313596780726]
We propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA) MKA uses manifold learning and the Normalized Pairwise Information Bottleneck measure to merge similar layers, reducing model size while preserving essential performance. Our findings show that MKA not only preserves model performance but also achieves substantial compression ratios, outperforming traditional pruning methods.
arXiv Detail & Related papers (2024-06-24T05:57:55Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval. We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models. We propose a soft prompt learning method where we expose the compressed model to the prompt learning process. Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z)
You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient [88.58536093633167]
Existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments. We propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere. Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1%-4.5% average accuracy improvement on the GLUE benchmark.
arXiv Detail & Related papers (2021-06-04T12:17:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.