Rethinking Floating Point Overheads for Mixed Precision DNN Accelerators
- URL: http://arxiv.org/abs/2101.11748v1
- Date: Wed, 27 Jan 2021 23:57:43 GMT
- Title: Rethinking Floating Point Overheads for Mixed Precision DNN Accelerators
- Authors: Hamzah Abdel-Aziz, Ali Shafiee, Jong Hoon Shin, Ardavan Pedram and
Joseph H. Hassoun
- Abstract summary: We propose a mixed-precision convolution unit architecture which supports different integer and floating point (FP) precisions.
We show how to integrate FP computations on integer-based architecture and evaluate overheads incurred by FP arithmetic support.
- Score: 2.6487352458568507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a mixed-precision convolution unit architecture
which supports different integer and floating point (FP) precisions. The
proposed architecture is based on low-bit inner product units and realizes
higher precision based on temporal decomposition. We illustrate how to
integrate FP computations on integer-based architecture and evaluate overheads
incurred by FP arithmetic support. We argue that alignment and addition
overhead for FP inner product can be significant since the maximum exponent
difference could be up to 58 bits, which results into a large alignment logic.
To address this issue, we illustrate empirically that no more than
26-bitproduct bits are required and up to 8-bit of alignment is sufficient in
most inference cases. We present novel optimizations based on the above
observations to reduce the FP arithmetic hardware overheads. Our empirical
results, based on simulation and hardware implementation, show significant
reduction in FP16 overhead. Over typical mixed precision implementation, the
proposed architecture achieves area improvements of up to 25% in TFLOPS/mm2and
up to 46% in TOPS/mm2with power efficiency improvements of up to 40% in
TFLOPS/Wand up to 63% in TOPS/W.
Related papers
- Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning [54.970571745690634]
This work presents the first systematic investigation into how numerical precision affects Large Language Models inference.<n>We develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32.<n>Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32.
arXiv Detail & Related papers (2025-06-11T08:23:53Z) - Towards Fully FP8 GEMM LLM Training at Scale [77.39425361120466]
Existing approaches often rely on suboptimal fine-grained FP8 kernels or fall back to higher-precision matrix multiplications.<n>We introduce a new class of LLM architectures that, for the first time, support FP8 computation for all GEMMs within transformer blocks during both forward and backward passes.<n>This enables unprecedented throughput gains, particularly at scale, while matching the downstream performance of standard BF16 training.
arXiv Detail & Related papers (2025-05-26T21:04:14Z) - FPQVAR: Floating Point Quantization for Visual Autoregressive Model with FPGA Hardware Co-design [5.4815337424005355]
Visual autoregressive ( VAR) modeling has marked a paradigm shift in image generation from next-token prediction to next-scale prediction.<n>To reduce the memory and computation cost, we propose FPQvar, an efficient post-training floating-point (FP) quantization framework for VAR.<n>Our accelerator on AMD-Xilinx VCK190 FPGA achieves a throughput of 1.1 image/s, which is 3.1x higher than the integer-based accelerator.
arXiv Detail & Related papers (2025-05-22T07:47:51Z) - Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.
This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z) - Scaling Laws for Floating Point Quantization Training [47.174957621592775]
This paper explores the effects of FP quantization targets, exponent bits, mantissa bits, and the calculation of the scaling factor in FP quantization training performance of LLM models.<n>We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers.
arXiv Detail & Related papers (2025-01-05T02:30:41Z) - "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization
Using Floating-Point Formats [25.543571445739936]
This study explores the viability of floating-point (FP) quantization for large language models (LLMs)
For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion.
For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
arXiv Detail & Related papers (2023-07-19T06:58:03Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings.
E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs.
We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z) - Transformer-based Context Condensation for Boosting Feature Pyramids in
Object Detection [77.50110439560152]
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF)
We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results.
In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
arXiv Detail & Related papers (2022-07-14T01:45:03Z) - FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference [1.1292678337479967]
fbgemm is a high-performance kernel library for high-performance quantized inference on current generation CPUs.
fbgemm achieves efficiency by fusing common quantization operations with a high-performance gemm implementation and by shape- and size-specific kernel code generation at runtime.
The library has been deployed at Facebook, where it delivers greater than 2x performance gains with respect to our current production baseline.
arXiv Detail & Related papers (2021-01-13T00:34:04Z) - I-BERT: Integer-only BERT Quantization [78.43819756382103]
We propose I-BERT, a novel quantization scheme for Transformer based models.
I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation.
We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline.
arXiv Detail & Related papers (2021-01-05T02:42:58Z) - Dynamic Feature Pyramid Networks for Object Detection [40.24111664691307]
We introduce an inception FPN in which each layer contains convolution filters with different kernel sizes to enlarge the receptive field.
We propose a new dynamic FPN (DyFPN) which consists of multiple branches with different computational costs.
Experiments conducted on benchmarks demonstrate that the proposed DyFPN significantly improves performance with the optimal allocation of computation resources.
arXiv Detail & Related papers (2020-12-01T19:03:55Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - SIMDive: Approximate SIMD Soft Multiplier-Divider for FPGAs with Tunable
Accuracy [3.4154033825543055]
This paper presents for the first time, an SIMD architecture based on novel multiplier and divider with tunable accuracy.
The proposed hybrid architecture implements Mitchell's algorithms and supports precision variability from 8 to 32 bits.
arXiv Detail & Related papers (2020-11-02T17:40:44Z) - HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point
Operations for Convolutional Neural Networks [0.2148535041822524]
Convolutional neural networks (CNNs) are typically trained using 16- or 32-bit floating-point (FP)
Low-precision floating-point (FP) can be highly effective for inference.
Existing processors do not generally support custom precision FP.
We propose hardware optimized bitslice-parallel floating-point operators (HOBFLOPS)
arXiv Detail & Related papers (2020-07-11T00:37:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.