Rethinking Floating Point Overheads for Mixed Precision DNN Accelerators
- URL: http://arxiv.org/abs/2101.11748v1
- Date: Wed, 27 Jan 2021 23:57:43 GMT
- Title: Rethinking Floating Point Overheads for Mixed Precision DNN Accelerators
- Authors: Hamzah Abdel-Aziz, Ali Shafiee, Jong Hoon Shin, Ardavan Pedram and
Joseph H. Hassoun
- Abstract summary: We propose a mixed-precision convolution unit architecture which supports different integer and floating point (FP) precisions.
We show how to integrate FP computations on integer-based architecture and evaluate overheads incurred by FP arithmetic support.
- Score: 2.6487352458568507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a mixed-precision convolution unit architecture
which supports different integer and floating point (FP) precisions. The
proposed architecture is based on low-bit inner product units and realizes
higher precision based on temporal decomposition. We illustrate how to
integrate FP computations on integer-based architecture and evaluate overheads
incurred by FP arithmetic support. We argue that alignment and addition
overhead for FP inner product can be significant since the maximum exponent
difference could be up to 58 bits, which results into a large alignment logic.
To address this issue, we illustrate empirically that no more than
26-bitproduct bits are required and up to 8-bit of alignment is sufficient in
most inference cases. We present novel optimizations based on the above
observations to reduce the FP arithmetic hardware overheads. Our empirical
results, based on simulation and hardware implementation, show significant
reduction in FP16 overhead. Over typical mixed precision implementation, the
proposed architecture achieves area improvements of up to 25% in TFLOPS/mm2and
up to 46% in TOPS/mm2with power efficiency improvements of up to 40% in
TFLOPS/Wand up to 63% in TOPS/W.
Related papers
- "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization
Using Floating-Point Formats [25.543571445739936]
This study explores the viability of floating-point (FP) quantization for large language models (LLMs)
For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion.
For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
arXiv Detail & Related papers (2023-07-19T06:58:03Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings.
E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs.
We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z) - Transformer-based Context Condensation for Boosting Feature Pyramids in
Object Detection [77.50110439560152]
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF)
We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results.
In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
arXiv Detail & Related papers (2022-07-14T01:45:03Z) - FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference [1.1292678337479967]
fbgemm is a high-performance kernel library for high-performance quantized inference on current generation CPUs.
fbgemm achieves efficiency by fusing common quantization operations with a high-performance gemm implementation and by shape- and size-specific kernel code generation at runtime.
The library has been deployed at Facebook, where it delivers greater than 2x performance gains with respect to our current production baseline.
arXiv Detail & Related papers (2021-01-13T00:34:04Z) - I-BERT: Integer-only BERT Quantization [78.43819756382103]
We propose I-BERT, a novel quantization scheme for Transformer based models.
I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation.
We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline.
arXiv Detail & Related papers (2021-01-05T02:42:58Z) - Dynamic Feature Pyramid Networks for Object Detection [40.24111664691307]
We introduce an inception FPN in which each layer contains convolution filters with different kernel sizes to enlarge the receptive field.
We propose a new dynamic FPN (DyFPN) which consists of multiple branches with different computational costs.
Experiments conducted on benchmarks demonstrate that the proposed DyFPN significantly improves performance with the optimal allocation of computation resources.
arXiv Detail & Related papers (2020-12-01T19:03:55Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - SIMDive: Approximate SIMD Soft Multiplier-Divider for FPGAs with Tunable
Accuracy [3.4154033825543055]
This paper presents for the first time, an SIMD architecture based on novel multiplier and divider with tunable accuracy.
The proposed hybrid architecture implements Mitchell's algorithms and supports precision variability from 8 to 32 bits.
arXiv Detail & Related papers (2020-11-02T17:40:44Z) - HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point
Operations for Convolutional Neural Networks [0.2148535041822524]
Convolutional neural networks (CNNs) are typically trained using 16- or 32-bit floating-point (FP)
Low-precision floating-point (FP) can be highly effective for inference.
Existing processors do not generally support custom precision FP.
We propose hardware optimized bitslice-parallel floating-point operators (HOBFLOPS)
arXiv Detail & Related papers (2020-07-11T00:37:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.