FrameQuant: Flexible Low-Bit Quantization for Transformers
- URL: http://arxiv.org/abs/2403.06082v2
- Date: Wed, 31 Jul 2024 05:59:31 GMT
- Title: FrameQuant: Flexible Low-Bit Quantization for Transformers
- Authors: Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, Vikas Singh,
- Abstract summary: Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks.
Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower.
We show, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains.
- Score: 25.569106620123346
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end hardware. To mitigate this difficulty, Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower, significantly boosting compute/memory/latency efficiency. Such models have been successfully quantized to four bits with some performance loss. In this work, we outline a simple scheme to quantize Transformer-based models to just two bits (plus some overhead) with only a small drop in accuracy. Key to our formulation is a concept borrowed from Harmonic analysis called Fusion Frames. Our main finding is that the quantization must take place not in the original weight space, but instead in the Fusion Frame representations. If quantization is interpreted as the addition of noise, our casting of the problem allows invoking an extensive body of known consistent recovery and noise robustness guarantees. Further, if desired, de-noising filters are known in closed form. We show empirically, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains. The code is available at https://github.com/vsingh-group/FrameQuant
Related papers
- SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models [58.5019443418822]
Diffusion models have been proven highly effective at generating high-quality images.
As these models grow larger, they require significantly more memory and suffer from higher latency.
In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits.
arXiv Detail & Related papers (2024-11-07T18:59:58Z) - decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points [10.238677144792279]
decoupleQ abandons the traditional quantization paradigm and decouples the model parameters into integer and floating-point parts.
Our method has achieved well on-line accuracy near fp16/bf16 on the 2-bit quantization of large speech models in ByteDance.
arXiv Detail & Related papers (2024-04-19T10:02:53Z) - NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search [7.971065005161565]
quantization is a technique to convert floating point representations to low bit-width fixed point representations.
We show how to learn new quantized weights over the entire quantized space.
We show the ability of the method to achieve state-of-the-art compression rates in both, data-free and data-driven configurations.
arXiv Detail & Related papers (2023-08-10T14:19:58Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - One Model for All Quantization: A Quantized Network Supporting Hot-Swap
Bit-Width Adjustment [36.75157407486302]
We propose a method to train a model for all quantization that supports diverse bit-widths.
We use wavelet decomposition and reconstruction to increase the diversity of weights.
Our method can achieve accuracy comparable to dedicated models trained at the same precision.
arXiv Detail & Related papers (2021-05-04T08:10:50Z) - Differentiable Model Compression via Pseudo Quantization Noise [99.89011673907814]
We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator.
We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
arXiv Detail & Related papers (2021-04-20T14:14:03Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.