Design and Scaffolded Training of an Efficient DNN Operator for Computer
Vision on the Edge
- URL: http://arxiv.org/abs/2108.11441v1
- Date: Wed, 25 Aug 2021 19:22:25 GMT
- Title: Design and Scaffolded Training of an Efficient DNN Operator for Computer
Vision on the Edge
- Authors: Vinod Ganesan and Pratyush Kumar
- Abstract summary: FuSeConv is a drop-in replacement for depthwise separable convolutions.
FuSeConv factorizes convolution fully along their spatial and depth dimensions.
Neural Operator Scaffolding scaffolds the training of FuSeConv by distilling knowledge from depthwise separable convolutions.
- Score: 3.3767251810292955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Massively parallel systolic arrays and resource-efficient depthwise separable
convolutions are two promising techniques to accelerate DNN inference on the
edge. Interestingly, their combination is inefficient: Computational patterns
of depthwise separable convolutions do not exhibit a rhythmic systolic flow and
lack sufficient data reuse to saturate systolic arrays. We formally analyse
this inefficiency and propose an efficient operator, an optimal hardware
dataflow, and a superior training methodology towards alleviating this. The
efficient operator, called FuSeConv, is a drop-in replacement for depthwise
separable convolutions. FuSeConv factorizes convolution fully along their
spatial and depth dimensions. The resultant computation efficiently maps to
systolic arrays. The optimal dataflow, called Spatial-Tiled Output Stationary
(ST-OS), maximizes the efficiency of FuSeConv on systolic arrays. It maps
independent convolutions to rows of the array to maximize resource utilization
with negligible VLSI overheads. Neural Operator Scaffolding (NOS) scaffolds the
training of FuSeConv by distilling knowledge from the expensive depthwise
separable convolutions. This bridges the accuracy gap between FuSeConv networks
and baselines. Additionally, NOS can be combined with Neural Architecture
Search (NAS) to trade-off latency and accuracy. The HW/SW co-design of FuSeConv
with ST-OS achieves a significant speedup of 4.1-9.25X with state-of-the-art
efficient networks for ImageNet. The parameter efficiency of FuSeConv and its
significant out-performance over depthwise separable convolutions on systolic
arrays illustrates their promise as a strong solution on the edge. Training
FuSeConv networks with NOS achieves accuracy comparable to the baselines.
Further, by combining NOS with NAS, we design networks that define
state-of-the-art models improving on both accuracy and latency on systolic
arrays.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - DCP: Learning Accelerator Dataflow for Neural Network via Propagation [52.06154296196845]
This work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort.
DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives.
For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples.
arXiv Detail & Related papers (2024-10-09T05:16:44Z) - A Generalization of Continuous Relaxation in Structured Pruning [0.3277163122167434]
Trends indicate that deeper and larger neural networks with an increasing number of parameters achieve higher accuracy than smaller neural networks.
We generalize structured pruning with algorithms for network augmentation, pruning, sub-network collapse and removal.
The resulting CNN executes efficiently on GPU hardware without computationally expensive sparse matrix operations.
arXiv Detail & Related papers (2023-08-28T14:19:13Z) - DVFO: Learning-Based DVFS for Energy-Efficient Edge-Cloud Collaborative
Inference [12.095934624748686]
We propose DVFO, a novel DVFS-enabled edge-cloud collaborative inference framework.
It automatically co-optimizes the CPU, GPU and memory frequencies of edge devices, and the feature maps to be offloaded to cloud servers.
It significantly reduces the energy consumption by 33% on average, compared to state-of-the-art schemes.
arXiv Detail & Related papers (2023-06-02T07:00:42Z) - TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent
Kernels [141.29156234353133]
State-of-the-art convex learning methods can perform far worse than their centralized counterparts when clients have dissimilar data distributions.
We show this disparity can largely be attributed to challenges presented by non-NISTity.
We propose a Train-Convexify neural network (TCT) procedure to sidestep this issue.
arXiv Detail & Related papers (2022-07-13T16:58:22Z) - S2Engine: A Novel Systolic Architecture for Sparse Convolutional Neural
Networks [5.417507302691321]
S2Engine transmits compressed data internally and allows each processing element to dynamically select an aligned data from the compressed dataflow in convolution.
Compared to the naive systolic array, S2Engine achieves about $3.2times$ and about $3.0times$ improvements on speed and energy efficiency, respectively.
arXiv Detail & Related papers (2021-06-15T06:08:37Z) - FuSeConv: Fully Separable Convolutions for Fast Inference on Systolic
Arrays [2.8583189395674653]
We propose FuSeConv as a drop-in replacement for depth-wise separable convolution.
FuSeConv generalizes the decomposition of convolutions fully to separable 1D convolutions along spatial and depth dimensions.
We achieve a significant speed-up of 3x-7x with the MobileNet family of networks on a systolic array of size 64x64, with comparable accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-05-27T20:19:39Z) - Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network.
We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z) - PHEW: Constructing Sparse Networks that Learn Fast and Generalize Well
without Training Data [10.01323660393278]
We show how to design sparse neural networks for faster convergence, without any training data, using the Synflow-L2 algorithm.
We propose a new method to construct sparse networks, without any training data, referred to as Paths with Higher-Edge Weights (PHEW)
arXiv Detail & Related papers (2020-10-22T00:20:59Z) - Distillation Guided Residual Learning for Binary Convolutional Neural
Networks [83.6169936912264]
It is challenging to bridge the performance gap between Binary CNN (BCNN) and Floating point CNN (FCNN)
We observe that, this performance gap leads to substantial residuals between intermediate feature maps of BCNN and FCNN.
To minimize the performance gap, we enforce BCNN to produce similar intermediate feature maps with the ones of FCNN.
This training strategy, i.e., optimizing each binary convolutional block with block-wise distillation loss derived from FCNN, leads to a more effective optimization to BCNN.
arXiv Detail & Related papers (2020-07-10T07:55:39Z) - Toward fast and accurate human pose estimation via soft-gated skip
connections [97.06882200076096]
This paper is on highly accurate and highly efficient human pose estimation.
We re-analyze this design choice in the context of improving both the accuracy and the efficiency over the state-of-the-art.
Our model achieves state-of-the-art results on the MPII and LSP datasets.
arXiv Detail & Related papers (2020-02-25T18:51:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.