Differentiable Model Compression via Pseudo Quantization Noise
- URL: http://arxiv.org/abs/2104.09987v1
- Date: Tue, 20 Apr 2021 14:14:03 GMT
- Title: Differentiable Model Compression via Pseudo Quantization Noise
- Authors: Alexandre D\'efossez, Yossi Adi, Gabriel Synnaeve
- Abstract summary: We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator.
We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
- Score: 99.89011673907814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose to add independent pseudo quantization noise to model parameters
during training to approximate the effect of a quantization operator. This
method, DiffQ, is differentiable both with respect to the unquantized
parameters, and the number of bits used. Given a single hyper-parameter
expressing the desired balance between the quantized model size and accuracy,
DiffQ can optimize the number of bits used per individual weight or groups of
weights, in a single training. We experimentally verify that our method
outperforms state-of-the-art quantization techniques on several benchmarks and
architectures for image classification, language modeling, and audio source
separation. For instance, on the Wikitext-103 language modeling benchmark,
DiffQ compresses a 16 layers transformer model by a factor of 8, equivalent to
4 bits precision, while losing only 0.5 points of perplexity. Code is available
at: https://github.com/facebookresearch/diffq
Related papers
- Low-Bitwidth Floating Point Quantization for Efficient High-Quality Diffusion Models [2.926259075657424]
Diffusion models generate images by iteratively denoising random Gaussian noise using deep neural networks.
Recent works propose low-bitwidth (e.g., 8-bit or 4-bit) quantization for diffusion models, however 4-bit integer quantization typically results in low-quality images.
We propose an effective floating-point quantization method for diffusion models that provides better image quality compared to integer quantization methods.
arXiv Detail & Related papers (2024-08-13T15:56:20Z) - FrameQuant: Flexible Low-Bit Quantization for Transformers [25.569106620123346]
Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks.
Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower.
We show, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains.
arXiv Detail & Related papers (2024-03-10T04:01:49Z) - The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model.
The final model size depends on both the number of parameters of the original model and the rate of compression.
We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - One Model for All Quantization: A Quantized Network Supporting Hot-Swap
Bit-Width Adjustment [36.75157407486302]
We propose a method to train a model for all quantization that supports diverse bit-widths.
We use wavelet decomposition and reconstruction to increase the diversity of weights.
Our method can achieve accuracy comparable to dedicated models trained at the same precision.
arXiv Detail & Related papers (2021-05-04T08:10:50Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators.
We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.