Effective Interplay between Sparsity and Quantization: From Theory to Practice
- URL: http://arxiv.org/abs/2405.20935v2
- Date: Tue, 28 Jan 2025 12:26:45 GMT
- Title: Effective Interplay between Sparsity and Quantization: From Theory to Practice
- Authors: Simla Burcu Harma, Ayan Chakraborty, Elizaveta Kostenok, Danila Mishin, Dongho Ha, Babak Falsafi, Martin Jaggi, Ming Liu, Yunho Oh, Suvinay Subramanian, Amir Yazdanbakhsh,
- Abstract summary: We show how sparsity and quantization interact when combined together.
We show that even if applied in the correct order, the compounded errors from sparsity and quantization can significantly harm accuracy.
Our findings extend to the efficient deployment of large models in resource-constrained compute platforms.
- Score: 33.697590845745815
- License:
- Abstract: The increasing size of deep neural networks (DNNs) necessitates effective model compression to reduce their computational and memory footprints. Sparsity and quantization are two prominent compression methods that have been shown to reduce DNNs' computational and memory footprints significantly while preserving model accuracy. However, how these two methods interact when combined together remains a key question for developers, as many tacitly assume that they are orthogonal, meaning that their combined use does not introduce additional errors beyond those introduced by each method independently. In this paper, we provide the first mathematical proof that sparsity and quantization are non-orthogonal. We corroborate these results with experiments spanning a range of large language models, including the OPT and LLaMA model families (with 125M to 8B parameters), and vision models like ViT and ResNet. We show that the order in which we apply these methods matters because applying quantization before sparsity may disrupt the relative importance of tensor elements, which may inadvertently remove significant elements from a tensor. More importantly, we show that even if applied in the correct order, the compounded errors from sparsity and quantization can significantly harm accuracy. Our findings extend to the efficient deployment of large models in resource-constrained compute platforms to reduce serving cost, offering insights into best practices for applying these compression methods to maximize hardware resource efficiency without compromising accuracy.
Related papers
- Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization.
This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z) - Q-VLM: Post-training Quantization for Large Vision-Language Models [73.19871905102545]
We propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference.
We mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy.
Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation.
arXiv Detail & Related papers (2024-10-10T17:02:48Z) - Towards Efficient Pareto Set Approximation via Mixture of Experts Based Model Fusion [53.33473557562837]
Solving multi-objective optimization problems for large deep neural networks is a challenging task due to the complexity of the loss landscape and the expensive computational cost.
We propose a practical and scalable approach to solve this problem via mixture of experts (MoE) based model fusion.
By ensembling the weights of specialized single-task models, the MoE module can effectively capture the trade-offs between multiple objectives.
arXiv Detail & Related papers (2024-06-14T07:16:18Z) - CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks [1.5199992713356987]
This paper introduces CompactifAI, an innovative compression approach using quantum-inspired networks.
Our method is versatile and can be implemented with - or on top of - other compression techniques.
As a benchmark, we demonstrate that a combination of CompactifAI with quantization allows to reduce a 93% memory size of LlaMA 7B.
arXiv Detail & Related papers (2024-01-25T11:45:21Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Combining Compressions for Multiplicative Size Scaling on Natural
Language Tasks [7.813460653362095]
Quantization, knowledge distillation, and magnitude pruning are among the most popular methods for neural network compression in NLP.
We compare accuracy vs. model size tradeoffs across six BERT architecture sizes and eight GLUE tasks.
We find that quantization and distillation consistently provide greater benefit than pruning.
arXiv Detail & Related papers (2022-08-20T14:01:56Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z) - ECQ$^{\text{x}}$: Explainability-Driven Quantization for Low-Bit and
Sparse DNNs [13.446502051609036]
We develop and describe a novel quantization paradigm for deep neural networks (DNNs)
Our method leverages concepts of explainable AI (XAI) and concepts of information theory.
The ultimate goal is to preserve the most relevant weights in quantization clusters of highest information content.
arXiv Detail & Related papers (2021-09-09T12:57:06Z) - Exact Backpropagation in Binary Weighted Networks with Group Weight
Transformations [0.0]
Quantization based model compression serves as high performing and fast approach for inference.
Models that constrain the weights to binary values enable efficient implementation of the ubiquitous dot product.
arXiv Detail & Related papers (2021-07-03T10:29:34Z) - Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model.
This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs)
The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.