FAST: DNN Training Under Variable Precision Block Floating Point with
Stochastic Rounding
- URL: http://arxiv.org/abs/2110.15456v1
- Date: Thu, 28 Oct 2021 22:24:33 GMT
- Title: FAST: DNN Training Under Variable Precision Block Floating Point with
Stochastic Rounding
- Authors: Sai Qian Zhang, Bradley McDanel, H.T. Kung
- Abstract summary: Block Floating Point (BFP) can efficiently support quantization for Deep Neural Network (DNN) training.
We propose a Fast First, Accurate Second Training (FAST) system for DNNs, where the weights, activations, and gradients are represented in BFP.
- Score: 11.820523621760255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Block Floating Point (BFP) can efficiently support quantization for Deep
Neural Network (DNN) training by providing a wide dynamic range via a shared
exponent across a group of values. In this paper, we propose a Fast First,
Accurate Second Training (FAST) system for DNNs, where the weights,
activations, and gradients are represented in BFP. FAST supports matrix
multiplication with variable precision BFP input operands, enabling incremental
increases in DNN precision throughout training. By increasing the BFP precision
across both training iterations and DNN layers, FAST can greatly shorten the
training time while reducing overall hardware resource usage. Our FAST
Multipler-Accumulator (fMAC) supports dot product computations under multiple
BFP precisions. We validate our FAST system on multiple DNNs with different
datasets, demonstrating a 2-6$\times$ speedup in training on a single-chip
platform over prior work based on \textbf{mixed-precision or block} floating
point number systems while achieving similar performance in validation
accuracy.
Related papers
- BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices [14.536949788395837]
Block floating point (BFP) quantization is one of the representative compression approaches for reducing the memory and computational burden.
We develop a BFP-based bitwidth-aware analytical modeling framework (called BitQ'') for the best BFP implementation of DNN inference on embedded platforms.
arXiv Detail & Related papers (2024-09-25T17:03:49Z) - Trainable Fixed-Point Quantization for Deep Learning Acceleration on
FPGAs [30.325651150798915]
Quantization is a crucial technique for deploying deep learning models on resource-constrained devices, such as embedded FPGAs.
We present QFX, a trainable fixed-point quantization approach that automatically learns the binary-point position during model training.
QFX is implemented as a PyTorch-based library that efficiently emulates fixed-point arithmetic, supported by FPGA HLS.
arXiv Detail & Related papers (2024-01-31T02:18:27Z) - Projected Stochastic Gradient Descent with Quantum Annealed Binary Gradients [51.82488018573326]
We present QP-SBGD, a novel layer-wise optimiser tailored towards training neural networks with binary weights.
BNNs reduce the computational requirements and energy consumption of deep learning models with minimal loss in accuracy.
Our algorithm is implemented layer-wise, making it suitable to train larger networks on resource-limited quantum hardware.
arXiv Detail & Related papers (2023-10-23T17:32:38Z) - Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and
Dataflow Co-Design [15.47240906902083]
This paper presents a computation-efficient training scheme for N:M sparse DNNs using algorithm, architecture, and dataflow co-design.
At the algorithm level, a bidirectional weight pruning method, dubbed BDWP, is proposed to leverage the N:M sparsity of weights.
At the architecture level, a sparse accelerator for DNN training, namely SAT, is developed to support both the regular dense operations and the computation-efficient N:M sparse operations.
arXiv Detail & Related papers (2023-09-22T17:26:19Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - ApproxTrain: Fast Simulation of Approximate Multipliers for DNN Training
and Inference [4.386709201336175]
Hardware approximates have shown their effectiveness for gaining resource-efficiency in inference accelerators.
This paper presents ApproxTrain, an open-source framework that allows fast evaluation of training inference using simulated approximate multipliers.
arXiv Detail & Related papers (2022-09-09T07:42:05Z) - Recurrent Bilinear Optimization for Binary Neural Networks [58.972212365275595]
BNNs neglect the intrinsic bilinear relationship of real-valued weights and scale factors.
Our work is the first attempt to optimize BNNs from the bilinear perspective.
We obtain robust RBONNs, which show impressive performance over state-of-the-art BNNs on various models and datasets.
arXiv Detail & Related papers (2022-09-04T06:45:33Z) - Receptive Field-based Segmentation for Distributed CNN Inference
Acceleration in Collaborative Edge Computing [93.67044879636093]
We study inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing network.
We propose a novel collaborative edge computing using fused-layer parallelization to partition a CNN model into multiple blocks of convolutional layers.
arXiv Detail & Related papers (2022-07-22T18:38:11Z) - FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block
Floating Point Support [8.596477111386083]
This paper builds upon an algorithmic observation that we can accelerate the training by leveraging multiple BFP precisions.
We develop a flexible DNN training accelerator, dubbed FlexBlock, which supports three different BFP precision modes.
We evaluate the effectiveness of FlexBlock architecture using well-known DNNs on CIFAR, ImageNet and WMT14 datasets.
arXiv Detail & Related papers (2022-03-13T15:05:34Z) - Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid
Precoding [94.40747235081466]
We propose an end-to-end deep learning-based joint transceiver design algorithm for millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) systems.
We develop a DNN architecture that maps the received pilots into feedback bits at the receiver, and then further maps the feedback bits into the hybrid precoder at the transmitter.
arXiv Detail & Related papers (2021-10-22T20:49:02Z) - Distillation Guided Residual Learning for Binary Convolutional Neural
Networks [83.6169936912264]
It is challenging to bridge the performance gap between Binary CNN (BCNN) and Floating point CNN (FCNN)
We observe that, this performance gap leads to substantial residuals between intermediate feature maps of BCNN and FCNN.
To minimize the performance gap, we enforce BCNN to produce similar intermediate feature maps with the ones of FCNN.
This training strategy, i.e., optimizing each binary convolutional block with block-wise distillation loss derived from FCNN, leads to a more effective optimization to BCNN.
arXiv Detail & Related papers (2020-07-10T07:55:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.