Related papers: Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance

Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance

URL: http://arxiv.org/abs/2301.13376v1
Date: Tue, 31 Jan 2023 02:46:57 GMT
Title: Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance
Authors: Ian Colbert, Alessandro Pappalardo, Jakoba Petri-Koenig
Abstract summary: We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
Score: 68.8204255655161
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We leverage weight normalization as a means of constraining parameters during training using accumulator bit width bounds that we derive. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline. We then show that this reduction translates to increased design efficiency for custom FPGA-based accelerators. Finally, we show that our algorithm not only constrains weights to fit into an accumulator of user-defined bit width, but also increases the sparsity and compressibility of the resulting weights. Across all of our benchmark models trained with 8-bit weights and activations, we observe that constraining the hidden layers of quantized neural networks to fit into 16-bit accumulators yields an average 98.2% sparsity with an estimated compression rate of 46.5x all while maintaining 99.2% of the floating-point performance.

Related papers

Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [65.37942405146232]
We present a novel type of overload that carries with extremely lightweight state elements, achieved through ultra-low-precision quantization. The proposed SOLO achieves substantial memory savings (approximately 45 GB when training a 7B model) with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
PQS (Prune, Quantize, and Sort): Low-Bitwidth Accumulation of Dot Products in Neural Network Computations [4.089232204089156]
We present PQS, which uses three techniques together - Prune, Quantize, and Sort - to achieve low-bitwidth accumulation of dot products in neural network computations. Our method offers a 2.5x reduction in accumulator bitwidth while achieving model accuracy on par with floating-point baselines for multiple image classification tasks.
arXiv Detail & Related papers (2025-04-12T03:51:42Z)
MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search [7.564770908909927]
Quantization is a technique for creating efficient Deep Neural Networks (DNNs) We propose MixQuant, a search algorithm that finds the optimal custom quantization bit-width for each layer weight based on roundoff error. We show that combining MixQuant with BRECQ, a state-of-the-art quantization method, yields better quantized model accuracy than BRECQ alone.
arXiv Detail & Related papers (2023-09-29T15:49:54Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance [49.1574468325115]
accumulator-aware quantization (A2Q) is a novel weight quantization method designed to train quantized neural networks (QNNs) to avoid overflow during inference. A2Q introduces a unique formulation inspired by weight normalization that constrains the L1-norm of model weights according to accumulator bit width bounds. We show A2Q can train QNNs for low-precision accumulators while maintaining model accuracy competitive with a floating-point baseline.
arXiv Detail & Related papers (2023-08-25T17:28:58Z)
MINT: Multiplier-less INTeger Quantization for Energy Efficient Spiking Neural Networks [20.473852621915956]
We propose a uniform quantization scheme that efficiently compresses weights and membrane potentials in spiking neural networks (SNNs) MINT quantizes membrane potentials to an extremely low precision (2-bit), significantly reducing the memory footprint. Experimental results show that our method matches the accuracy of full-precision models and other state-of-the-art SNN quantization techniques.
arXiv Detail & Related papers (2023-05-16T23:38:35Z)
Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets [7.5195830365852085]
We propose a novel sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data.
arXiv Detail & Related papers (2022-07-13T17:46:08Z)
VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference [7.886868529510128]
Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors. Excessive quantization, reducing precision too aggressively, results in accuracy degradation. Per-vector scale factors can be implemented with low-bitwidth integers when using a two-level quantization scheme.
arXiv Detail & Related papers (2021-02-08T19:56:04Z)
Non-Parametric Adaptive Network Pruning [125.4414216272874]
We introduce non-parametric modeling to simplify the algorithm design. Inspired by the face recognition community, we use a message passing algorithm to obtain an adaptive number of exemplars. EPruner breaks the dependency on the training data in determining the "important" filters.
arXiv Detail & Related papers (2021-01-20T06:18:38Z)
DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs. Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance. We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z)
WrapNet: Neural Net Inference with Ultra-Low-Resolution Arithmetic [57.07483440807549]
We propose a method that adapts neural networks to use low-resolution (8-bit) additions in the accumulators, achieving classification accuracy comparable to their 32-bit counterparts. We demonstrate the efficacy of our approach on both software and hardware platforms.
arXiv Detail & Related papers (2020-07-26T23:18:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.