Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs
- URL: http://arxiv.org/abs/2311.12359v3
- Date: Fri, 5 Jul 2024 06:26:45 GMT
- Title: Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs
- Authors: Shivam Aggarwal, Hans Jakob Damsgaard, Alessandro Pappalardo, Giuseppe Franco, Thomas B. Preußer, Michaela Blott, Tulika Mitra,
- Abstract summary: Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead.
Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference.
We present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model.
- Score: 39.410068572891475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware cost with integers remains unexplored on FPGAs. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. We implement a custom FPGA-based multiply-accumulate operator library and explore the vast design space, comparing minifloat and integer representations across 3 to 8 bits for both weights and activations. We also examine the applicability of various integerbased quantization techniques to minifloats. Our experiments show that minifloats offer a promising alternative for emerging workloads such as vision transformers.
Related papers
- Integer or Floating Point? New Outlooks for Low-Bit Quantization on
Large Language Models [17.055400141733124]
Low-bit integer formats (e.g., INT8/INT4) have been the conventional choice for large language models (LLMs)
Low-bit floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU.
We propose the Mixture of Formats Quantization (MoFQ), which selects the optimal format on a layer-wise basis.
arXiv Detail & Related papers (2023-05-21T05:28:37Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model.
The final model size depends on both the number of parameters of the original model and the rate of compression.
We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z) - FP8 Quantization: The Power of the Exponent [19.179749424362686]
This paper investigates the benefit of the floating point format for neural network inference.
We detail the choices that can be made for the FP8 format, including the important choice of the number of bits for the mantissa and exponent.
We show how these findings translate to real networks, provide an efficient implementation for FP8 simulation, and a new algorithm.
arXiv Detail & Related papers (2022-08-19T09:03:00Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - All-You-Can-Fit 8-Bit Flexible Floating-Point Format for Accurate and
Memory-Efficient Inference of Deep Neural Networks [2.294014185517203]
This paper introduces an extremely flexible 8-bit floating-point (FFP8) format.
It achieves an extremely low accuracy loss of $0.1%sim 0.3%$ for several representative image classification models.
It is easy to turn a classical floating-point processing unit into an FFP8-compliant one, and the extra hardware cost is minor.
arXiv Detail & Related papers (2021-04-15T09:37:23Z) - EFloat: Entropy-coded Floating Point Format for Deep Learning [2.3204178451683264]
EFloat format encodes frequent exponent values with Huffman codes to minimize the average exponent field width.
The proposed encoding concept may be beneficial to low-precision formats including 8-bit floats.
arXiv Detail & Related papers (2021-02-04T15:58:01Z) - I-BERT: Integer-only BERT Quantization [78.43819756382103]
We propose I-BERT, a novel quantization scheme for Transformer based models.
I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation.
We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline.
arXiv Detail & Related papers (2021-01-05T02:42:58Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Subtensor Quantization for Mobilenets [5.735035463793008]
Quantization for deep neural networks (DNN) have enabled developers to deploy models with less memory and more efficient low-power inference.
In this paper, we analyzed several root causes of quantization loss and proposed alternatives that do not rely on per-channel or training-aware approaches.
We evaluate the image classification task on ImageNet dataset, and our post-training quantized 8-bit inference top-1 accuracy in within 0.7% of the floating point version.
arXiv Detail & Related papers (2020-11-04T15:41:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.