FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block
Floating Point Support
- URL: http://arxiv.org/abs/2203.06673v1
- Date: Sun, 13 Mar 2022 15:05:34 GMT
- Title: FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block
Floating Point Support
- Authors: Seock-Hwan Noh, Jahyun Koo, Seunghyun Lee, Jongse Park, Jaeha Kung
- Abstract summary: This paper builds upon an algorithmic observation that we can accelerate the training by leveraging multiple BFP precisions.
We develop a flexible DNN training accelerator, dubbed FlexBlock, which supports three different BFP precision modes.
We evaluate the effectiveness of FlexBlock architecture using well-known DNNs on CIFAR, ImageNet and WMT14 datasets.
- Score: 8.596477111386083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training deep neural networks (DNNs) is a computationally expensive job,
which can take weeks or months even with high performance GPUs. As a remedy for
this challenge, community has started exploring the use of more efficient data
representations in the training process, e.g., block floating point (BFP).
However, prior work on BFP-based DNN accelerators rely on a specific BFP
representation making them less versatile. This paper builds upon an
algorithmic observation that we can accelerate the training by leveraging
multiple BFP precisions without compromising the finally achieved accuracy.
Backed up by this algorithmic opportunity, we develop a flexible DNN training
accelerator, dubbed FlexBlock, which supports three different BFP precision
modes, possibly different among activation, weight, and gradient tensors. While
several prior works proposed such multi-precision support for DNN accelerators,
not only do they focus only on the inference, but also their core utilization
is suboptimal at a fixed precision and specific layer types when the training
is considered. Instead, FlexBlock is designed in such a way that high core
utilization is achievable for i) various layer types, and ii) three BFP
precisions by mapping data in a hierarchical manner to its compute units. We
evaluate the effectiveness of FlexBlock architecture using well-known DNNs on
CIFAR, ImageNet and WMT14 datasets. As a result, training in FlexBlock
significantly improves the training speed by 1.5~5.3x and the energy efficiency
by 2.4~7.0x on average compared to other training accelerators and incurs
marginal accuracy loss compared to full-precision training.
Related papers
- DCP: Learning Accelerator Dataflow for Neural Network via Propagation [52.06154296196845]
This work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort.
DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives.
For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples.
arXiv Detail & Related papers (2024-10-09T05:16:44Z) - BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices [14.536949788395837]
Block floating point (BFP) quantization is one of the representative compression approaches for reducing the memory and computational burden.
We develop a BFP-based bitwidth-aware analytical modeling framework (called BitQ'') for the best BFP implementation of DNN inference on embedded platforms.
arXiv Detail & Related papers (2024-09-25T17:03:49Z) - Enhancing Fast Feed Forward Networks with Load Balancing and a Master Leaf Node [49.08777822540483]
Fast feedforward networks (FFFs) exploit the observation that different regions of the input space activate distinct subsets of neurons in wide networks.
We propose the incorporation of load balancing and Master Leaf techniques into the FFF architecture to improve performance and simplify the training process.
arXiv Detail & Related papers (2024-05-27T05:06:24Z) - Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and
Dataflow Co-Design [15.47240906902083]
This paper presents a computation-efficient training scheme for N:M sparse DNNs using algorithm, architecture, and dataflow co-design.
At the algorithm level, a bidirectional weight pruning method, dubbed BDWP, is proposed to leverage the N:M sparsity of weights.
At the architecture level, a sparse accelerator for DNN training, namely SAT, is developed to support both the regular dense operations and the computation-efficient N:M sparse operations.
arXiv Detail & Related papers (2023-09-22T17:26:19Z) - Tensor-Compressed Back-Propagation-Free Training for (Physics-Informed)
Neural Networks [15.188785164091987]
Backward propagation (BP) is widely used to compute the gradients in neural network training.
It is hard to implement BP on edge devices due to the lack of hardware and software resources to support automatic differentiation.
This paper presents a completely BP-free framework that only requires forward propagation to train realistic neural networks.
arXiv Detail & Related papers (2023-08-18T23:56:50Z) - Recurrent Bilinear Optimization for Binary Neural Networks [58.972212365275595]
BNNs neglect the intrinsic bilinear relationship of real-valued weights and scale factors.
Our work is the first attempt to optimize BNNs from the bilinear perspective.
We obtain robust RBONNs, which show impressive performance over state-of-the-art BNNs on various models and datasets.
arXiv Detail & Related papers (2022-09-04T06:45:33Z) - FAST: DNN Training Under Variable Precision Block Floating Point with
Stochastic Rounding [11.820523621760255]
Block Floating Point (BFP) can efficiently support quantization for Deep Neural Network (DNN) training.
We propose a Fast First, Accurate Second Training (FAST) system for DNNs, where the weights, activations, and gradients are represented in BFP.
arXiv Detail & Related papers (2021-10-28T22:24:33Z) - Low-Precision Training in Logarithmic Number System using Multiplicative
Weight Update [49.948082497688404]
Training large-scale deep neural networks (DNNs) currently requires a significant amount of energy, leading to serious environmental impacts.
One promising approach to reduce the energy costs is representing DNNs with low-precision numbers.
We jointly design a lowprecision training framework involving a logarithmic number system (LNS) and a multiplicative weight update training method, termed LNS-Madam.
arXiv Detail & Related papers (2021-06-26T00:32:17Z) - FracTrain: Fractionally Squeezing Bit Savings Both Temporally and
Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients.
FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z) - Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network
Training [0.5219568203653523]
We develop a sparse DNN training accelerator that produces pruned models with the same accuracy as dense models without first training, then pruning, and finally retraining, a dense model.
Compared to training the equivalent unpruned models using a state-of-the-art DNN accelerator without sparse training support, Procrustes consumes up to 3.26$times$ less energy and offers up to 4$times$ speedup across a range of models, while pruning weights by an order of magnitude and maintaining unpruned accuracy.
arXiv Detail & Related papers (2020-09-23T07:39:55Z) - Distillation Guided Residual Learning for Binary Convolutional Neural
Networks [83.6169936912264]
It is challenging to bridge the performance gap between Binary CNN (BCNN) and Floating point CNN (FCNN)
We observe that, this performance gap leads to substantial residuals between intermediate feature maps of BCNN and FCNN.
To minimize the performance gap, we enforce BCNN to produce similar intermediate feature maps with the ones of FCNN.
This training strategy, i.e., optimizing each binary convolutional block with block-wise distillation loss derived from FCNN, leads to a more effective optimization to BCNN.
arXiv Detail & Related papers (2020-07-10T07:55:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.