Related papers: MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation

MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation

URL: http://arxiv.org/abs/2504.09072v1
Date: Sat, 12 Apr 2025 04:19:03 GMT
Title: MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation
Authors: Vikas Natesh, H. T. Kung, David Kong,
Abstract summary: MGS (Markov Greedy Sums) is a novel approach to improve the accuracy of low-bitwidth floating-point dot products in neural network computations.<n>We design, analyze, and implement the algorithm to minimize 8-bit floating point error at inference time for several neural networks.
Score: 3.638431342539701
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We offer a novel approach, MGS (Markov Greedy Sums), to improve the accuracy of low-bitwidth floating-point dot products in neural network computations. In conventional 32-bit floating-point summation, adding values with different exponents may lead to loss of precision in the mantissa of the smaller term, which is right-shifted to align with the larger term's exponent. Such shifting (a.k.a. 'swamping') is a significant source of numerical errors in accumulation when implementing low-bitwidth dot products (e.g., 8-bit floating point) as the mantissa has a small number of bits. We avoid most swamping errors by arranging the terms in dot product summation based on their exponents and summing the mantissas without overflowing the low-bitwidth accumulator. We design, analyze, and implement the algorithm to minimize 8-bit floating point error at inference time for several neural networks. In contrast to traditional sequential summation, our method has significantly lowered numerical errors, achieving classification accuracy on par with high-precision floating-point baselines for multiple image classification tasks. Our dMAC hardware units can reduce power consumption by up to 34.1\% relative to conventional MAC units.

Related papers

PQS (Prune, Quantize, and Sort): Low-Bitwidth Accumulation of Dot Products in Neural Network Computations [4.089232204089156]
We present PQS, which uses three techniques together - Prune, Quantize, and Sort - to achieve low-bitwidth accumulation of dot products in neural network computations.<n>Our method offers a 2.5x reduction in accumulator bitwidth while achieving model accuracy on par with floating-point baselines for multiple image classification tasks.
arXiv Detail & Related papers (2025-04-12T03:51:42Z)
Addition is All You Need for Energy-efficient Language Models [13.063639073834906]
A floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity multiplication L-Mul algorithm that approximates floating point number multiplication with integer addition operations.
arXiv Detail & Related papers (2024-10-01T17:53:28Z)
Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs [39.410068572891475]
Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. We present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model.
arXiv Detail & Related papers (2023-11-21T05:27:16Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z)
AMED: Automatic Mixed-Precision Quantization for Edge Devices [3.5223695602582614]
Quantized neural networks are well known for reducing the latency, power consumption, and model size without significant harm to the performance. Mixed-precision quantization offers better utilization of customized hardware that supports arithmetic operations at different bitwidths.
arXiv Detail & Related papers (2022-05-30T21:23:22Z)
Deep Neural Networks to Correct Sub-Precision Errors in CFD [0.0]
Several machine learning techniques have been successful in correcting the errors arising from spatial discretization. We employ a Convolutional Neural Network together with a fully differentiable numerical solver performing 16-bit arithmetic to learn a tightly-coupled ML-CFD hybrid solver. Compared to the 16-bit solver, we demonstrate the efficacy of the ML-CFD hybrid solver towards reducing the error accumulation in the velocity field and improving the kinetic energy spectrum at higher frequencies.
arXiv Detail & Related papers (2022-02-09T02:32:40Z)
SignalNet: A Low Resolution Sinusoid Decomposition and Estimation Network [79.04274563889548]
We propose SignalNet, a neural network architecture that detects the number of sinusoids and estimates their parameters from quantized in-phase and quadrature samples. We introduce a worst-case learning threshold for comparing the results of our network relative to the underlying data distributions. In simulation, we find that our algorithm is always able to surpass the threshold for three-bit data but often cannot exceed the threshold for one-bit data.
arXiv Detail & Related papers (2021-06-10T04:21:20Z)
HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)
AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation. Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)
Bayesian Bits: Unifying Quantization and Pruning [73.27732135853243]
We introduce Bayesian Bits, a practical method for joint mixed precision quantization and pruning through gradient based optimization. We experimentally validate our proposed method on several benchmark datasets and show that we can learn pruned, mixed precision networks.
arXiv Detail & Related papers (2020-05-14T16:00:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.