NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search
- URL: http://arxiv.org/abs/2308.05600v1
- Date: Thu, 10 Aug 2023 14:19:58 GMT
- Title: NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search
- Authors: Edouard Yvinec, Arnaud Dapogny and Kevin Bailly
- Abstract summary: quantization is a technique to convert floating point representations to low bit-width fixed point representations.
We show how to learn new quantized weights over the entire quantized space.
We show the ability of the method to achieve state-of-the-art compression rates in both, data-free and data-driven configurations.
- Score: 7.971065005161565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural network (DNN) deployment has been confined to larger hardware
devices due to their expensive computational requirements. This challenge has
recently reached another scale with the emergence of large language models
(LLMs). In order to reduce both their memory footprint and latency, a promising
technique is quantization. It consists in converting floating point
representations to low bit-width fixed point representations, usually by
assuming a uniform mapping onto a regular grid. This process, referred to in
the literature as uniform quantization, may however be ill-suited as most DNN
weights and activations follow a bell-shaped distribution. This is even worse
on LLMs whose weight distributions are known to exhibit large, high impact,
outlier values. In this work, we propose an improvement over the most commonly
adopted way to tackle this limitation in deep learning models quantization,
namely, non-uniform quantization. NUPES leverages automorphisms to preserve the
scalar multiplications. Such transformations are derived from power functions.
However, the optimization of the exponent parameter and weight values remains a
challenging and novel problem which could not be solved with previous post
training optimization techniques which only learn to round up or down weight
values in order to preserve the predictive function. We circumvent this
limitation with a new paradigm: learning new quantized weights over the entire
quantized space. Similarly, we enable the optimization of the power exponent,
i.e. the optimization of the quantization operator itself during training by
alleviating all the numerical instabilities. The resulting predictive function
is compatible with integer-only low-bit inference. We show the ability of the
method to achieve state-of-the-art compression rates in both, data-free and
data-driven configurations.
Related papers
- IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models [68.55148272295916]
We propose IntLoRA, to push the efficiency limits by using integer type (INT) low-rank parameters to adapt the quantized diffusion models.
IntLoRA offers three key advantages: (i) for fine-tuning, the pre-trained weights are quantized, reducing memory usage; (ii) for storage, both pre-trained and low-rank weights are in INT which consumes less disk space; (iii) for inference, IntLoRA weights can be naturally merged into quantized pre-trained weights through efficient integer multiplication or bit-shifting.
arXiv Detail & Related papers (2024-10-29T05:50:17Z) - Magic for the Age of Quantized DNNs [0.6008132390640294]
We introduce a novel normalization (Layer-Batch Normalization) that is independent of the mini-batch size and does not require any additional cost during inference.
We also quantize activation functions using the same function and apply surrogate gradients to train the model with both quantized weights and the quantized activation functions.
arXiv Detail & Related papers (2024-03-22T07:21:09Z) - Post-Training Quantization for Re-parameterization via Coarse & Fine
Weight Splitting [13.270381125055275]
We propose a coarse & fine weight splitting (CFWS) method to reduce quantization error of weight.
We develop an improved KL metric to determine optimal quantization scales for activation.
For example, the quantized RepVGG-A1 model exhibits a mere 0.3% accuracy loss.
arXiv Detail & Related papers (2023-12-17T02:31:20Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Neural Functional Transformers [99.98750156515437]
This paper uses the attention mechanism to define a novel set of permutation equivariant weight-space layers called neural functional Transformers (NFTs)
NFTs respect weight-space permutation symmetries while incorporating the advantages of attention, which have exhibited remarkable success across multiple domains.
We also leverage NFTs to develop Inr2Array, a novel method for computing permutation invariant representations from the weights of implicit neural representations (INRs)
arXiv Detail & Related papers (2023-05-22T23:38:27Z) - PowerQuant: Automorphism Search for Non-Uniform Quantization [37.82255888371488]
We identify the uniformity of the quantization operator as a limitation of existing approaches, and propose a data-free non-uniform method.
We show that our approach, dubbed PowerQuant, only require simple modifications in the quantized DNN activation functions.
arXiv Detail & Related papers (2023-01-24T08:30:14Z) - Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via
Generalized Straight-Through Estimation [48.838691414561694]
Nonuniform-to-Uniform Quantization (N2UQ) is a method that can maintain the strong representation ability of nonuniform methods while being hardware-friendly and efficient.
N2UQ outperforms state-of-the-art nonuniform quantization methods by 0.71.8% on ImageNet.
arXiv Detail & Related papers (2021-11-29T18:59:55Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - Exact Backpropagation in Binary Weighted Networks with Group Weight
Transformations [0.0]
Quantization based model compression serves as high performing and fast approach for inference.
Models that constrain the weights to binary values enable efficient implementation of the ubiquitous dot product.
arXiv Detail & Related papers (2021-07-03T10:29:34Z) - Robust Implicit Networks via Non-Euclidean Contractions [63.91638306025768]
Implicit neural networks show improved accuracy and significant reduction in memory consumption.
They can suffer from ill-posedness and convergence instability.
This paper provides a new framework to design well-posed and robust implicit neural networks.
arXiv Detail & Related papers (2021-06-06T18:05:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.