Related papers: Training with Quantization Noise for Extreme Model Compression

Training with Quantization Noise for Extreme Model Compression

URL: http://arxiv.org/abs/2004.07320v3
Date: Sun, 28 Feb 2021 21:43:34 GMT
Title: Training with Quantization Noise for Extreme Model Compression
Authors: Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi Gribonval, Herve Jegou, Armand Joulin
Abstract summary: We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
Score: 57.51832088938618
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods where the approximations introduced by STE are severe, such as Product Quantization. Our proposal is to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights. Controlling the amount of noise and its form allows for extreme compression rates while maintaining the performance of the original model. As a result we establish new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification. For example, applying our method to state-of-the-art Transformer and ConvNet architectures, we can achieve 82.5% accuracy on MNLI by compressing RoBERTa to 14MB and 80.0 top-1 accuracy on ImageNet by compressing an EfficientNet-B3 to 3.3MB.

Related papers

CondiQuant: Condition Number Based Low-Bit Quantization for Image Super-Resolution [59.91470739501034]
We propose CondiQuant, a condition number based low-bit post-training quantization for image super-resolution. We show that CondiQuant outperforms existing state-of-the-art post-training quantization methods in accuracy without computation overhead.
arXiv Detail & Related papers (2025-02-21T14:04:30Z)
SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression [7.6131620435684875]
SLIM is a new one-shot compression framework that holistically integrates hardware-friendly quantization, sparsity, and low-rank approximation. SLIM improves model accuracy by up to 5.66% (LLaMA-2-7B) for 2:4 sparsity with 4-bit weight quantization, outperforming prior methods. We also propose an optional PEFT recipe that further improves accuracy by up to 1.66% (LLaMA-2-13B) compared to SLIM without fine-tuning.
arXiv Detail & Related papers (2024-10-12T18:36:07Z)
Retraining-free Model Quantization via One-Shot Weight-Coupling Learning [41.299675080384]
Mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers. MPQ is typically organized into a searching-retraining two-stage process. In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression.
arXiv Detail & Related papers (2024-01-03T05:26:57Z)
Hyperspherical Quantization: Toward Smaller and More Accurate Models [17.154801913113566]
Vector quantization aims at reducing the model size by indexing model weights with full-precision embeddings. Binary and other low-precision quantization methods can reduce the model size up to 32$times$, however, at the cost of a considerable accuracy drop. We propose an efficient framework for ternary quantization to produce smaller and more accurate compressed models.
arXiv Detail & Related papers (2022-12-24T04:42:15Z)
Vertical Layering of Quantized Neural Networks for Heterogeneous Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z)
Deep learning model compression using network sensitivity and gradients [3.52359746858894]
We present model compression algorithms for both non-retraining and retraining conditions. In the first case, we propose the Bin & Quant algorithm for compression of the deep learning models using the sensitivity of the network parameters. In the second case, we propose our novel gradient-weighted k-means clustering algorithm (GWK)
arXiv Detail & Related papers (2022-10-11T03:02:40Z)
CrAM: A Compression-Aware Minimizer [103.29159003723815]
We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way. CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning. CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
arXiv Detail & Related papers (2022-07-28T16:13:28Z)
OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization [32.60139548889592]
We propose a novel One-shot Pruning-Quantization (OPQ) in this paper. OPQ analytically solves the compression allocation with pre-trained weight parameters only. We propose a unified channel-wise quantization method that enforces all channels of each layer to share a common codebook.
arXiv Detail & Related papers (2022-05-23T09:05:25Z)
Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks. These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices. We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z)
Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform [58.60004238261117]
We propose a versatile deep image compression network based on Spatial Feature Transform (SFT arXiv:1804.02815) Our model covers a wide range of compression rates using a single model, which is controlled by arbitrary pixel-wise quality maps. The proposed framework allows us to perform task-aware image compressions for various tasks.
arXiv Detail & Related papers (2021-08-21T17:30:06Z)
Pruning Ternary Quantization [32.32812780843498]
Inference time, model size, and accuracy are three key factors in deep model compression. We propose pruning ternary quantization (PTQ): a simple, effective, symmetric ternary quantization method. Our method is verified on image classification, object detection/segmentation tasks with different network structures.
arXiv Detail & Related papers (2021-07-23T02:18:00Z)
Differentiable Model Compression via Pseudo Quantization Noise [99.89011673907814]
We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator. We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
arXiv Detail & Related papers (2021-04-20T14:14:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.