I/O Lower Bounds for Auto-tuning of Convolutions in CNNs
- URL: http://arxiv.org/abs/2012.15667v1
- Date: Thu, 31 Dec 2020 15:46:01 GMT
- Title: I/O Lower Bounds for Auto-tuning of Convolutions in CNNs
- Authors: Xiaoyang Zhang, Junmin Xiao, Guangming Tan
- Abstract summary: We develop a general I/O lower bound theory for a composite algorithm which consists of several different sub-computations.
We design the near I/O-optimal dataflow strategies for the two main convolution algorithms by fully exploiting the data reuse.
Experiment results show that our dataflow strategies with the auto-tuning approach can achieve about 3.32x performance speedup on average over cuDNN.
- Score: 2.571796445061562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolution is the most time-consuming part in the computation of
convolutional neural networks (CNNs), which have achieved great successes in
numerous applications. Due to the complex data dependency and the increase in
the amount of model samples, the convolution suffers from high overhead on data
movement (i.e., memory access). This work provides comprehensive analysis and
methodologies to minimize the communication for the convolution in CNNs. With
an in-depth analysis of the recent I/O complexity theory under the red-blue
game model, we develop a general I/O lower bound theory for a composite
algorithm which consists of several different sub-computations. Based on the
proposed theory, we establish the data movement lower bound results of two
representative convolution algorithms in CNNs, namely the direct convolution
and Winograd algorithm. Next, derived from I/O lower bound results, we design
the near I/O-optimal dataflow strategies for the two main convolution
algorithms by fully exploiting the data reuse. Furthermore, in order to push
the envelope of performance of the near I/O-optimal dataflow strategies
further, an aggressive design of auto-tuning based on I/O lower bounds, is
proposed to search an optimal parameter configuration for the direct
convolution and Winograd algorithm on GPU, such as the number of threads and
the size of shared memory used in each thread block. Finally, experiment
evaluation results on the direct convolution and Winograd algorithm show that
our dataflow strategies with the auto-tuning approach can achieve about 3.32x
performance speedup on average over cuDNN. In addition, compared with TVM,
which represents the state-of-the-art technique for auto-tuning, not only our
auto-tuning method based on I/O lower bounds can find the optimal parameter
configuration faster, but also our solution has higher performance than the
optimal solution provided by TVM.
Related papers
- Ensemble Quadratic Assignment Network for Graph Matching [52.20001802006391]
Graph matching is a commonly used technique in computer vision and pattern recognition.
Recent data-driven approaches have improved the graph matching accuracy remarkably.
We propose a graph neural network (GNN) based approach to combine the advantages of data-driven and traditional methods.
arXiv Detail & Related papers (2024-03-11T06:34:05Z) - Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO)
TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models.
Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Distributed Learning and Democratic Embeddings: Polynomial-Time Source
Coding Schemes Can Achieve Minimax Lower Bounds for Distributed Gradient
Descent under Communication Constraints [46.17631511884969]
We consider the problem of compressing a vector in the n-dimensional Euclidean space, subject to a bit-budget of R-bits per dimension.
We show that Democratic and Near-Democratic source-coding schemes are (near) optimal in the sense that the covering efficiency of the resulting quantizer is either dimension independent, or has a very weak logarithmic dependence.
We propose a distributed optimization algorithm: DGD-DEF, which employs our proposed coding strategy, and achieves the minimax optimal convergence rate to within (near) constant factors.
arXiv Detail & Related papers (2021-03-13T00:04:11Z) - Normalized Convolution Upsampling for Refined Optical Flow Estimation [23.652615797842085]
Normalized Convolution UPsampler (NCUP) is an efficient joint upsampling approach to produce the full-resolution flow during the training of optical flow CNNs.
Our proposed approach formulates the upsampling task as a sparse problem and employs the normalized convolutional neural networks to solve it.
We achieve state-of-the-art results on Sintel benchmark with 6% error reduction, and on-par on the KITTI dataset, while having 7.5% fewer parameters.
arXiv Detail & Related papers (2021-02-13T18:34:03Z) - Analytical Characterization and Design Space Exploration for
Optimization of CNNs [10.15406080228806]
Loop-level optimization, including loop tiling and loop permutation, are fundamental transformations to reduce data movement.
This paper develops an analytical modeling approach for finding the best loop-level optimization configuration for CNNs on multi-core CPUs.
arXiv Detail & Related papers (2021-01-24T21:36:52Z) - FBNetV3: Joint Architecture-Recipe Search using Predictor Pretraining [65.39532971991778]
We present an accuracy predictor that scores architecture and training recipes jointly, guiding both sample selection and ranking.
We run fast evolutionary searches in just CPU minutes to generate architecture-recipe pairs for a variety of resource constraints.
FBNetV3 makes up a family of state-of-the-art compact neural networks that outperform both automatically and manually-designed competitors.
arXiv Detail & Related papers (2020-06-03T05:20:21Z) - FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity
to Non-IID Data [59.50904660420082]
Federated Learning (FL) has become a popular paradigm for learning from distributed data.
To effectively utilize data at different devices without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a "computation then aggregation" (CTA) model.
arXiv Detail & Related papers (2020-05-22T23:07:42Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.