Related papers: Addition is All You Need for Energy-efficient Language Models

Addition is All You Need for Energy-efficient Language Models

URL: http://arxiv.org/abs/2410.00907v2
Date: Wed, 2 Oct 2024 15:34:12 GMT
Title: Addition is All You Need for Energy-efficient Language Models
Authors: Hongyin Luo, Wei Sun,
Abstract summary: A floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity multiplication L-Mul algorithm that approximates floating point number multiplication with integer addition operations.
Score: 13.063639073834906
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large neural networks spend most computation on floating point tensor multiplications. In this work, we find that a floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity multiplication L-Mul algorithm that approximates floating point number multiplication with integer addition operations. The new algorithm costs significantly less computation resource than 8-bit floating point multiplication but achieves higher precision. Compared to 8-bit floating point multiplications, the proposed method achieves higher precision but consumes significantly less bit-level computation. Since multiplying floating point numbers requires substantially higher energy compared to integer addition operations, applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by element-wise floating point tensor multiplications and 80% energy cost of dot products. We calculated the theoretical error expectation of L-Mul, and evaluated the algorithm on a wide range of textual, visual, and symbolic tasks, including natural language understanding, structural reasoning, mathematics, and commonsense question answering. Our numerical analysis experiments agree with the theoretical error estimation, which indicates that L-Mul with 4-bit mantissa achieves comparable precision as float8_e4m3 multiplications, and L-Mul with 3-bit mantissa outperforms float8_e5m2. Evaluation results on popular benchmarks show that directly applying L-Mul to the attention mechanism is almost lossless. We further show that replacing all floating point multiplications with 3-bit mantissa L-Mul in a transformer model achieves equivalent precision as using float8_e4m3 as accumulation precision in both fine-tuning and inference.

Related papers

MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation [3.638431342539701]
MGS (Markov Greedy Sums) is a novel approach to improve the accuracy of low-bitwidth floating-point dot products in neural network computations. We design, analyze, and implement the algorithm to minimize 8-bit floating point error at inference time for several neural networks.
arXiv Detail & Related papers (2025-04-12T04:19:03Z)
Speeding up and reducing memory usage for scientific machine learning via mixed precision [3.746841257785099]
Training neural networks for partial differential equations requires large amounts of memory and computational resources. In search of computational efficiency, training neural networks using half precision (float16) has gained substantial interest. We explore mixed precision, which combines the float16 and float32 numerical formats to reduce memory usage and increase computational speed. Our experiments showcase that mixed precision training not only substantially decreases training times and memory demands but also maintains model accuracy.
arXiv Detail & Related papers (2024-01-30T00:37:57Z)
Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs [39.410068572891475]
Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. We present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model.
arXiv Detail & Related papers (2023-11-21T05:27:16Z)
Guaranteed Approximation Bounds for Mixed-Precision Neural Operators [83.64404557466528]
We build on intuition that neural operator learning inherently induces an approximation error. We show that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.
arXiv Detail & Related papers (2023-07-27T17:42:06Z)
Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z)
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers. A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z)
An Efficient Summation Algorithm for the Accuracy, Convergence and Reproducibility of Parallel Numerical Methods [0.0]
We have introduced a new parallel algorithm for summing a sequence of floating-point numbers. This algorithm which scales up easily with the number of processors, adds numbers of the same exponents first. In this article, our main contribution is an extensive analysis of its efficiency with respect to several properties.
arXiv Detail & Related papers (2022-05-11T08:31:48Z)
I-BERT: Integer-only BERT Quantization [78.43819756382103]
We propose I-BERT, a novel quantization scheme for Transformer based models. I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline.
arXiv Detail & Related papers (2021-01-05T02:42:58Z)
Deep Neural Network Training without Multiplications [0.0]
We show that ResNet can be trained using this operation with competitive classification accuracy. This method will enable eliminating the multiplications in deep neural-network training and inference.
arXiv Detail & Related papers (2020-12-07T05:40:50Z)
HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)
NITI: Training Integer Neural Networks Using Integer-only Arithmetic [4.361357921751159]
We present NITI, an efficient deep neural network training framework that computes exclusively with integer arithmetic. A proof-of-concept open-source software implementation of NITI that utilizes native 8-bit integer operations is presented. NITI achieves negligible accuracy degradation on the MNIST and CIFAR10 datasets using 8-bit integer storage and computation.
arXiv Detail & Related papers (2020-09-28T07:41:36Z)
Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization. Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.