Related papers: NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search

NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search

URL: http://arxiv.org/abs/2308.05600v1
Date: Thu, 10 Aug 2023 14:19:58 GMT
Title: NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search
Authors: Edouard Yvinec, Arnaud Dapogny and Kevin Bailly
Abstract summary: quantization is a technique to convert floating point representations to low bit-width fixed point representations. We show how to learn new quantized weights over the entire quantized space. We show the ability of the method to achieve state-of-the-art compression rates in both, data-free and data-driven configurations.
Score: 7.971065005161565
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep neural network (DNN) deployment has been confined to larger hardware devices due to their expensive computational requirements. This challenge has recently reached another scale with the emergence of large language models (LLMs). In order to reduce both their memory footprint and latency, a promising technique is quantization. It consists in converting floating point representations to low bit-width fixed point representations, usually by assuming a uniform mapping onto a regular grid. This process, referred to in the literature as uniform quantization, may however be ill-suited as most DNN weights and activations follow a bell-shaped distribution. This is even worse on LLMs whose weight distributions are known to exhibit large, high impact, outlier values. In this work, we propose an improvement over the most commonly adopted way to tackle this limitation in deep learning models quantization, namely, non-uniform quantization. NUPES leverages automorphisms to preserve the scalar multiplications. Such transformations are derived from power functions. However, the optimization of the exponent parameter and weight values remains a challenging and novel problem which could not be solved with previous post training optimization techniques which only learn to round up or down weight values in order to preserve the predictive function. We circumvent this limitation with a new paradigm: learning new quantized weights over the entire quantized space. Similarly, we enable the optimization of the power exponent, i.e. the optimization of the quantization operator itself during training by alleviating all the numerical instabilities. The resulting predictive function is compatible with integer-only low-bit inference. We show the ability of the method to achieve state-of-the-art compression rates in both, data-free and data-driven configurations.

Related papers

An Efficient Quantum Classifier Based on Hamiltonian Representations [50.467930253994155]
Quantum machine learning (QML) is a discipline that seeks to transfer the advantages of quantum computing to data-driven tasks. We propose an efficient approach that circumvents the costs associated with data encoding by mapping inputs to a finite set of Pauli strings. We evaluate our approach on text and image classification tasks, against well-established classical and quantum models.
arXiv Detail & Related papers (2025-04-13T11:49:53Z)
CondiQuant: Condition Number Based Low-Bit Quantization for Image Super-Resolution [59.91470739501034]
We propose CondiQuant, a condition number based low-bit post-training quantization for image super-resolution. We show that CondiQuant outperforms existing state-of-the-art post-training quantization methods in accuracy without computation overhead.
arXiv Detail & Related papers (2025-02-21T14:04:30Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE) RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models [68.55148272295916]
We propose IntLoRA, to push the efficiency limits by using integer type (INT) low-rank parameters to adapt the quantized diffusion models. IntLoRA offers three key advantages: (i) for fine-tuning, the pre-trained weights are quantized, reducing memory usage; (ii) for storage, both pre-trained and low-rank weights are in INT which consumes less disk space; (iii) for inference, IntLoRA weights can be naturally merged into quantized pre-trained weights through efficient integer multiplication or bit-shifting.
arXiv Detail & Related papers (2024-10-29T05:50:17Z)
Magic for the Age of Quantized DNNs [0.6008132390640294]
We introduce a novel normalization (Layer-Batch Normalization) that is independent of the mini-batch size and does not require any additional cost during inference. We also quantize activation functions using the same function and apply surrogate gradients to train the model with both quantized weights and the quantized activation functions.
arXiv Detail & Related papers (2024-03-22T07:21:09Z)
Post-Training Quantization for Re-parameterization via Coarse & Fine Weight Splitting [13.270381125055275]
We propose a coarse & fine weight splitting (CFWS) method to reduce quantization error of weight. We develop an improved KL metric to determine optimal quantization scales for activation. For example, the quantized RepVGG-A1 model exhibits a mere 0.3% accuracy loss.
arXiv Detail & Related papers (2023-12-17T02:31:20Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
Neural Functional Transformers [99.98750156515437]
This paper uses the attention mechanism to define a novel set of permutation equivariant weight-space layers called neural functional Transformers (NFTs) NFTs respect weight-space permutation symmetries while incorporating the advantages of attention, which have exhibited remarkable success across multiple domains. We also leverage NFTs to develop Inr2Array, a novel method for computing permutation invariant representations from the weights of implicit neural representations (INRs)
arXiv Detail & Related papers (2023-05-22T23:38:27Z)
PowerQuant: Automorphism Search for Non-Uniform Quantization [37.82255888371488]
We identify the uniformity of the quantization operator as a limitation of existing approaches, and propose a data-free non-uniform method. We show that our approach, dubbed PowerQuant, only require simple modifications in the quantized DNN activation functions.
arXiv Detail & Related papers (2023-01-24T08:30:14Z)
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation [48.838691414561694]
Nonuniform-to-Uniform Quantization (N2UQ) is a method that can maintain the strong representation ability of nonuniform methods while being hardware-friendly and efficient. N2UQ outperforms state-of-the-art nonuniform quantization methods by 0.71.8% on ImageNet.
arXiv Detail & Related papers (2021-11-29T18:59:55Z)
Cluster-Promoting Quantization with Bit-Drop for Minimizing Network Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks. DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons. We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z)
Exact Backpropagation in Binary Weighted Networks with Group Weight Transformations [0.0]
Quantization based model compression serves as high performing and fast approach for inference. Models that constrain the weights to binary values enable efficient implementation of the ubiquitous dot product.
arXiv Detail & Related papers (2021-07-03T10:29:34Z)
Robust Implicit Networks via Non-Euclidean Contractions [63.91638306025768]
Implicit neural networks show improved accuracy and significant reduction in memory consumption. They can suffer from ill-posedness and convergence instability. This paper provides a new framework to design well-posed and robust implicit neural networks.
arXiv Detail & Related papers (2021-06-06T18:05:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.