EvoPress: Accurate Dynamic Model Compression via Evolutionary Search
- URL: http://arxiv.org/abs/2410.14649v2
- Date: Tue, 01 Jul 2025 14:41:08 GMT
- Title: EvoPress: Accurate Dynamic Model Compression via Evolutionary Search
- Authors: Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh,
- Abstract summary: EvoPress is an evolutionary framework for dynamic compression of large language models.<n>It identifies optimal compression profiles in a highly efficient manner.<n>We set new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths.
- Score: 33.86918407429272
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on estimating the importance of a given layer, implicitly assuming that layers contribute independently to the overall compression error. We begin from the motivating observation that this independence assumption does not generally hold for LLM compression: pruning a model further may even significantly recover performance. To address this, we propose EvoPress, a novel evolutionary framework for dynamic LLM compression. By formulating dynamic compression as a general optimization problem, EvoPress identifies optimal compression profiles in a highly efficient manner, and generalizes across diverse models and compression techniques. Via EvoPress, we achieve state-of-the-art performance for dynamic compression of Llama, Mistral, and Phi models, setting new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths. Our code is available at https://github.com/IST-DASLab/EvoPress}.
Related papers
- UniPCGC: Towards Practical Point Cloud Geometry Compression via an Efficient Unified Approach [4.754973569457509]
We propose an efficient unified point cloud geometry compression framework, dubbed as UniPCGC.<n>It supports lossy compression, lossless compression, variable rate and variable complexity.<n>Our method achieves a compression ratio (CR) gain of 8.1% on lossless compression, and a Bjontegaard Delta Rate (BD-Rate) gain of 14.02% on lossy compression.
arXiv Detail & Related papers (2025-03-24T10:51:28Z) - A General Error-Theoretical Analysis Framework for Constructing Compression Strategies [3.1316260533944007]
We propose a Compression Error Theory (CET) framework to determine the optimal compression level for each layer.
Specifically, on the ResNet-34 model, CET achieves nearly 11$times$ parameter compression while even surpassing performance comparable to the original model.
arXiv Detail & Related papers (2025-02-19T06:12:43Z) - Choose Your Model Size: Any Compression by a Single Gradient Descent [9.074689052563878]
We present Any Compression via Iterative Pruning (ACIP)
ACIP is an algorithmic approach to determine a compression-performance trade-off from a single gradient descent run.
We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z) - Compression for Better: A General and Stable Lossless Compression Framework [7.356622397575378]
Key challenge is effectively leveraging compression errors to minimize model loss.
We propose a general textbfLosstextbfLess textbfCompression theoretical framework (textbfLLC)
We apply various compression techniques, including quantization and decomposition.
arXiv Detail & Related papers (2024-12-09T09:55:54Z) - Large Language Models for Lossless Image Compression: Next-Pixel Prediction in Language Space is All You Need [53.584140947828004]
Language large model (LLM) with unprecedented intelligence is a general-purpose lossless compressor for various data modalities.
We propose P$2$-LLM, a next-pixel prediction-based LLM, which integrates various elaborated insights and methodologies.
Experiments on benchmark datasets demonstrate that P$2$-LLM can beat SOTA classical and learned codecs.
arXiv Detail & Related papers (2024-11-19T12:15:40Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Hyper-Compression: Model Compression via Hyperfunction [20.47369296713829]
We propose the so-called hyper-compression that turns the model compression into the issue of parameter representation via a hyperfunction.<n>This suggests a novel mechanism for model compression, substantially different from the existing pruning, quantization, distillation, and decomposition.<n>We show that hyper-compression enjoys the following textbfPNAS merits: 1) textbfPreferable compression ratio; 2) textbfNo post-hoc retraining; 3) textbfAffordable inference time; and 4) textbfShort compression time
arXiv Detail & Related papers (2024-09-01T02:57:41Z) - End-to-end learned Lossy Dynamic Point Cloud Attribute Compression [5.717288278431968]
This study introduces an end-to-end learned dynamic lossy attribute coding approach.
We employ a context model that leverage previous latent space in conjunction with an auto-regressive context model for encoding the latent tensor into a bitstream.
arXiv Detail & Related papers (2024-08-20T09:06:59Z) - Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models [21.025001473355996]
We formalize the problem of prompt compression for large language models (LLMs)
We present a framework to unify token-level prompt compression methods which create hard prompts for black-box models.
We show that there is a large gap between the performance of current prompt compression methods and the optimal strategy.
arXiv Detail & Related papers (2024-07-22T09:40:13Z) - LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression [43.048684907893104]
This paper focuses on task-agnostic prompt compression for better generalizability and efficiency.
We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one.
Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT.
arXiv Detail & Related papers (2024-03-19T17:59:56Z) - Compression of Structured Data with Autoencoders: Provable Benefit of
Nonlinearities and Depth [83.15263499262824]
We prove that gradient descent converges to a solution that completely disregards the sparse structure of the input.
We show how to improve upon Gaussian performance for the compression of sparse data by adding a denoising function to a shallow architecture.
We validate our findings on image datasets, such as CIFAR-10 and MNIST.
arXiv Detail & Related papers (2024-02-07T16:32:29Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - DiffRate : Differentiable Compression Rate for Efficient Vision
Transformers [98.33906104846386]
Token compression aims to speed up large-scale vision transformers (e.g. ViTs) by pruning (dropping) or merging tokens.
DiffRate is a novel token compression method that has several appealing properties prior arts do not have.
arXiv Detail & Related papers (2023-05-29T10:15:19Z) - Just CHOP: Embarrassingly Simple LLM Compression [27.64461490974072]
Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint.
We show that simple layer pruning coupled with an extended language model pretraining produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale.
We also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.
arXiv Detail & Related papers (2023-05-24T08:18:35Z) - OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization [32.60139548889592]
We propose a novel One-shot Pruning-Quantization (OPQ) in this paper.
OPQ analytically solves the compression allocation with pre-trained weight parameters only.
We propose a unified channel-wise quantization method that enforces all channels of each layer to share a common codebook.
arXiv Detail & Related papers (2022-05-23T09:05:25Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - Compressing Neural Networks: Towards Determining the Optimal Layer-wise
Decomposition [62.41259783906452]
We present a novel global compression framework for deep neural networks.
It automatically analyzes each layer to identify the optimal per-layer compression ratio.
Our results open up new avenues for future research into the global performance-size trade-offs of modern neural networks.
arXiv Detail & Related papers (2021-07-23T20:01:30Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - Successive Pruning for Model Compression via Rate Distortion Theory [15.598364403631528]
We study NN compression from an information-theoretic approach and show that rate distortion theory suggests pruning to achieve the theoretical limits of NN compression.
Our derivation also provides an end-to-end compression pipeline involving a novel pruning strategy.
Our method consistently outperforms the existing pruning strategies and reduces the pruned model's size by 2.5 times.
arXiv Detail & Related papers (2021-02-16T18:17:57Z) - A flexible, extensible software framework for model compression based on
the LC algorithm [10.787390511207683]
We propose a software framework that allows a user to compress a neural network or other machine learning model with minimal effort.
The library is written in Python and PyTorch and available in Github.
arXiv Detail & Related papers (2020-05-15T21:14:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.