Compression Scaling Laws:Unifying Sparsity and Quantization
- URL: http://arxiv.org/abs/2502.16440v1
- Date: Sun, 23 Feb 2025 04:47:36 GMT
- Title: Compression Scaling Laws:Unifying Sparsity and Quantization
- Authors: Elias Frantar, Utku Evci, Wonpyo Park, Neil Houlsby, Dan Alistarh,
- Abstract summary: We investigate how different compression techniques affect the scaling behavior of large language models (LLMs) during pretraining.<n>We show that weight-only quantization achieves strong parameter efficiency multipliers, while full quantization of both weights and activations shows diminishing returns at lower bitwidths.<n>Our results suggest that different compression techniques can be unified under a common scaling law framework.
- Score: 65.05818215339498
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate how different compression techniques -- such as weight and activation quantization, and weight sparsity -- affect the scaling behavior of large language models (LLMs) during pretraining. Building on previous work showing that weight sparsity acts as a constant multiplier on model size in scaling laws, we demonstrate that this "effective parameter" scaling pattern extends to quantization as well. Specifically, we establish that weight-only quantization achieves strong parameter efficiency multipliers, while full quantization of both weights and activations shows diminishing returns at lower bitwidths. Our results suggest that different compression techniques can be unified under a common scaling law framework, enabling principled comparison and combination of these methods.
Related papers
- DilateQuant: Accurate and Efficient Diffusion Quantization via Weight Dilation [3.78219736760145]
Quantization of diffusion models is a promising way to compress and accelerate models.
Existing methods cannot maintain both accuracy and efficiency simultaneously for low-bit quantization.
We propose DilateQuant, a novel quantization framework for diffusion models that offers comparable accuracy and high efficiency.
arXiv Detail & Related papers (2024-09-22T04:21:29Z) - Data-free Weight Compress and Denoise for Large Language Models [96.68582094536032]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.<n>We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - AWEQ: Post-Training Quantization with Activation-Weight Equalization for
Large Language Models [0.18416014644193066]
AWEQ excels in both ultra-low-bit quantization and 8-bit weight and activation (W8A8) quantization.
We have further refined the equalization method to mitigate quantization bias error, ensuring the robustness of the model.
arXiv Detail & Related papers (2023-11-02T15:18:22Z) - Probabilistic Weight Fixing: Large-scale training of neural network
weight uncertainties for quantization [7.2282857478457805]
Weight-sharing quantization has emerged as a technique to reduce energy expenditure during inference in large neural networks.
This paper proposes a probabilistic framework based on Bayesian neural networks (BNNs) and a variational relaxation to identify which weights can be moved to which cluster centre.
Our method outperforms the state-of-the-art quantization method top-1 accuracy by 1.6% on ImageNet using DeiT-Tiny.
arXiv Detail & Related papers (2023-09-24T08:04:28Z) - Quantized Sparse Weight Decomposition for Neural Network Compression [12.24566619983231]
We show that this approach can be seen as a unification of weight SVD, vector quantization, and sparse PCA.
Our method is applicable to both moderate compression regimes, unlike vector quantization, and extreme compression regimes.
arXiv Detail & Related papers (2022-07-22T12:40:03Z) - BiTAT: Neural Network Binarization with Task-dependent Aggregated
Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation.
Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration.
This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z) - Unified Multivariate Gaussian Mixture for Efficient Neural Image
Compression [151.3826781154146]
latent variables with priors and hyperpriors is an essential problem in variational image compression.
We find inter-correlations and intra-correlations exist when observing latent variables in a vectorized perspective.
Our model has better rate-distortion performance and an impressive $3.18times$ compression speed up.
arXiv Detail & Related papers (2022-03-21T11:44:17Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.