Related papers: FrameQuant: Flexible Low-Bit Quantization for Transformers

FrameQuant: Flexible Low-Bit Quantization for Transformers

URL: http://arxiv.org/abs/2403.06082v1
Date: Sun, 10 Mar 2024 04:01:49 GMT
Title: FrameQuant: Flexible Low-Bit Quantization for Transformers
Authors: Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, Vikas Singh
Abstract summary: Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower. We show that (almost) two-bit quantization for Transformer models promises sizable efficiency gains.
Score: 27.93241211038938
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end hardware. To mitigate this difficulty, Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower, significantly boosting compute/memory/latency efficiency. Such models have been successfully quantized to four bits with some performance loss. In this work, we outline a simple scheme to quantize Transformer-based models to just two bits (plus some overhead) with only a small drop in accuracy. Key to our formulation is a concept borrowed from Harmonic analysis called Fusion Frames. Our main finding is that the quantization must take place not in the original weight space, but instead in the Fusion Frame representations. If quantization is interpreted as the addition of noise, our casting of the problem allows invoking an extensive body of known consistent recovery and noise robustness guarantees. Further, if desired, de-noising filters are known in closed form. We show empirically, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains.

Related papers

QuantFace: Low-Bit Post-Training Quantization for One-Step Diffusion Face Restoration [109.89807858620242]
Diffusion models have been achieving remarkable performance in face restoration.<n>The heavy computations of diffusion models make it difficult to deploy them on devices like smartphones.<n>We propose QuantFace, a novel low-bit quantization for one-step diffusion face restoration models.
arXiv Detail & Related papers (2025-06-01T03:52:59Z)
Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design [34.04231165571518]
Speculative decoding and quantization effectively accelerate memory-bound inference of large language models.<n>Quantization achieves this by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications.<n>Experiments show that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding.
arXiv Detail & Related papers (2025-05-28T09:55:08Z)
Low-bit Model Quantization for Deep Neural Networks: A Survey [123.89598730307208]
This article surveys the recent five-year progress towards low-bit quantization on deep neural networks (DNNs)<n>We discuss and compare the state-of-the-art quantization methods and classify them into 8 main categories and 24 sub-categories according to their core techniques.<n>We shed light on the potential research opportunities in the field of model quantization.
arXiv Detail & Related papers (2025-05-08T13:26:19Z)
CondiQuant: Condition Number Based Low-Bit Quantization for Image Super-Resolution [59.91470739501034]
We propose CondiQuant, a condition number based low-bit post-training quantization for image super-resolution. We show that CondiQuant outperforms existing state-of-the-art post-training quantization methods in accuracy without computation overhead.
arXiv Detail & Related papers (2025-02-21T14:04:30Z)
Masked Generative Nested Transformers with Decode Time Scaling [21.34984197218021]
In this work, we aim to address the bottleneck of inference computational efficiency in visual generation algorithms. We design a decode time model scaling schedule to utilize compute effectively, and we can cache and reuse some of the computation. Our experiments show that with almost $3times$ less compute than baseline, our model obtains competitive performance.
arXiv Detail & Related papers (2025-02-01T09:41:01Z)
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models [58.5019443418822]
Diffusion models have been proven highly effective at generating high-quality images. As these models grow larger, they require significantly more memory and suffer from higher latency. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits.
arXiv Detail & Related papers (2024-11-07T18:59:58Z)
decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points [10.238677144792279]
decoupleQ abandons the traditional quantization paradigm and decouples the model parameters into integer and floating-point parts. Our method has achieved well on-line accuracy near fp16/bf16 on the 2-bit quantization of large speech models in ByteDance.
arXiv Detail & Related papers (2024-04-19T10:02:53Z)
NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search [7.971065005161565]
quantization is a technique to convert floating point representations to low bit-width fixed point representations. We show how to learn new quantized weights over the entire quantized space. We show the ability of the method to achieve state-of-the-art compression rates in both, data-free and data-driven configurations.
arXiv Detail & Related papers (2023-08-10T14:19:58Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values. This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters. In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z)
One Model for All Quantization: A Quantized Network Supporting Hot-Swap Bit-Width Adjustment [36.75157407486302]
We propose a method to train a model for all quantization that supports diverse bit-widths. We use wavelet decomposition and reconstruction to increase the diversity of weights. Our method can achieve accuracy comparable to dedicated models trained at the same precision.
arXiv Detail & Related papers (2021-05-04T08:10:50Z)
Differentiable Model Compression via Pseudo Quantization Noise [99.89011673907814]
We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator. We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
arXiv Detail & Related papers (2021-04-20T14:14:03Z)
Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.