Related papers: FAST: DNN Training Under Variable Precision Block Floating Point with Stochastic Rounding

FAST: DNN Training Under Variable Precision Block Floating Point with Stochastic Rounding

URL: http://arxiv.org/abs/2110.15456v1
Date: Thu, 28 Oct 2021 22:24:33 GMT
Title: FAST: DNN Training Under Variable Precision Block Floating Point with Stochastic Rounding
Authors: Sai Qian Zhang, Bradley McDanel, H.T. Kung
Abstract summary: Block Floating Point (BFP) can efficiently support quantization for Deep Neural Network (DNN) training. We propose a Fast First, Accurate Second Training (FAST) system for DNNs, where the weights, activations, and gradients are represented in BFP.
Score: 11.820523621760255
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Block Floating Point (BFP) can efficiently support quantization for Deep Neural Network (DNN) training by providing a wide dynamic range via a shared exponent across a group of values. In this paper, we propose a Fast First, Accurate Second Training (FAST) system for DNNs, where the weights, activations, and gradients are represented in BFP. FAST supports matrix multiplication with variable precision BFP input operands, enabling incremental increases in DNN precision throughout training. By increasing the BFP precision across both training iterations and DNN layers, FAST can greatly shorten the training time while reducing overall hardware resource usage. Our FAST Multipler-Accumulator (fMAC) supports dot product computations under multiple BFP precisions. We validate our FAST system on multiple DNNs with different datasets, demonstrating a 2-6$\times$ speedup in training on a single-chip platform over prior work based on \textbf{mixed-precision or block} floating point number systems while achieving similar performance in validation accuracy.

Related papers

Dynamic Tsetlin Machine Accelerators for On-Chip Training at the Edge using FPGAs [0.3440236962613469]
This paper presents a Dynamic Tsetlin Machine (DTM) training accelerator as an alternative to Deep Neural Networks (DNNs) DTM trains with fewer multiply-accumulates, devoid of derivative computation. The proposed accelerator offers 2.54x more Giga operations per second per Watt (GOP/s per W) and uses 6x less power than the next-best comparable design.
arXiv Detail & Related papers (2025-04-28T13:38:53Z)
BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices [14.536949788395837]
Block floating point (BFP) quantization is one of the representative compression approaches for reducing the memory and computational burden. We develop a BFP-based bitwidth-aware analytical modeling framework (called BitQ'') for the best BFP implementation of DNN inference on embedded platforms.
arXiv Detail & Related papers (2024-09-25T17:03:49Z)
Trainable Fixed-Point Quantization for Deep Learning Acceleration on FPGAs [30.325651150798915]
Quantization is a crucial technique for deploying deep learning models on resource-constrained devices, such as embedded FPGAs. We present QFX, a trainable fixed-point quantization approach that automatically learns the binary-point position during model training. QFX is implemented as a PyTorch-based library that efficiently emulates fixed-point arithmetic, supported by FPGA HLS.
arXiv Detail & Related papers (2024-01-31T02:18:27Z)
Projected Stochastic Gradient Descent with Quantum Annealed Binary Gradients [51.82488018573326]
We present QP-SBGD, a novel layer-wise optimiser tailored towards training neural networks with binary weights. BNNs reduce the computational requirements and energy consumption of deep learning models with minimal loss in accuracy. Our algorithm is implemented layer-wise, making it suitable to train larger networks on resource-limited quantum hardware.
arXiv Detail & Related papers (2023-10-23T17:32:38Z)
Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and Dataflow Co-Design [15.47240906902083]
This paper presents a computation-efficient training scheme for N:M sparse DNNs using algorithm, architecture, and dataflow co-design. At the algorithm level, a bidirectional weight pruning method, dubbed BDWP, is proposed to leverage the N:M sparsity of weights. At the architecture level, a sparse accelerator for DNN training, namely SAT, is developed to support both the regular dense operations and the computation-efficient N:M sparse operations.
arXiv Detail & Related papers (2023-09-22T17:26:19Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
ApproxTrain: Fast Simulation of Approximate Multipliers for DNN Training and Inference [4.386709201336175]
Hardware approximates have shown their effectiveness for gaining resource-efficiency in inference accelerators. This paper presents ApproxTrain, an open-source framework that allows fast evaluation of training inference using simulated approximate multipliers.
arXiv Detail & Related papers (2022-09-09T07:42:05Z)
Recurrent Bilinear Optimization for Binary Neural Networks [58.972212365275595]
BNNs neglect the intrinsic bilinear relationship of real-valued weights and scale factors. Our work is the first attempt to optimize BNNs from the bilinear perspective. We obtain robust RBONNs, which show impressive performance over state-of-the-art BNNs on various models and datasets.
arXiv Detail & Related papers (2022-09-04T06:45:33Z)
Receptive Field-based Segmentation for Distributed CNN Inference Acceleration in Collaborative Edge Computing [93.67044879636093]
We study inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing network. We propose a novel collaborative edge computing using fused-layer parallelization to partition a CNN model into multiple blocks of convolutional layers.
arXiv Detail & Related papers (2022-07-22T18:38:11Z)
FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block Floating Point Support [8.596477111386083]
This paper builds upon an algorithmic observation that we can accelerate the training by leveraging multiple BFP precisions. We develop a flexible DNN training accelerator, dubbed FlexBlock, which supports three different BFP precision modes. We evaluate the effectiveness of FlexBlock architecture using well-known DNNs on CIFAR, ImageNet and WMT14 datasets.
arXiv Detail & Related papers (2022-03-13T15:05:34Z)
Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid Precoding [94.40747235081466]
We propose an end-to-end deep learning-based joint transceiver design algorithm for millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) systems. We develop a DNN architecture that maps the received pilots into feedback bits at the receiver, and then further maps the feedback bits into the hybrid precoder at the transmitter.
arXiv Detail & Related papers (2021-10-22T20:49:02Z)
A Meta-Learning Approach to the Optimal Power Flow Problem Under Topology Reconfigurations [69.73803123972297]
We propose a DNN-based OPF predictor that is trained using a meta-learning (MTL) approach. The developed OPF-predictor is validated through simulations using benchmark IEEE bus systems.
arXiv Detail & Related papers (2020-12-21T17:39:51Z)
Distillation Guided Residual Learning for Binary Convolutional Neural Networks [83.6169936912264]
It is challenging to bridge the performance gap between Binary CNN (BCNN) and Floating point CNN (FCNN) We observe that, this performance gap leads to substantial residuals between intermediate feature maps of BCNN and FCNN. To minimize the performance gap, we enforce BCNN to produce similar intermediate feature maps with the ones of FCNN. This training strategy, i.e., optimizing each binary convolutional block with block-wise distillation loss derived from FCNN, leads to a more effective optimization to BCNN.
arXiv Detail & Related papers (2020-07-10T07:55:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.