Related papers: High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures

High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures

URL: http://arxiv.org/abs/2408.00278v1
Date: Thu, 1 Aug 2024 04:37:03 GMT
Title: High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures
Authors: Xiang Fu, Xinpeng Zhang, Jixiang Ma, Peng Zhao, Shuai Lu, Xu T. Liu,
Abstract summary: This paper proposes three novel data layouts for im2win convolution: NHWC, CHWN, and CHWN8. We compare the optimized im2win convolution with the direct convolution and PyTorch's im2col-based convolution across the aforementioned layouts on SIMD machines. Our optimized im2win and direct convolutions achieved up to 95% and 94% of machine's theoretical peak performance, respectively.
Score: 26.146937503081876
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Convolution is the core component within deep neural networks and it is computationally intensive and time consuming. Tensor data layouts significantly impact convolution operations in terms of memory access and computational efficiency. Yet, there is still a lack of comprehensive performance characterization on data layouts on SIMD architectures concerning convolution methods. This paper proposes three novel data layouts for im2win convolution: NHWC, CHWN, and CHWN8, and introduces a set of general optimization techniques for both direct and im2win convolutions. We compare the optimized im2win convolution with the direct convolution and PyTorch's im2col-based convolution across the aforementioned layouts on SIMD machines. The experiments demonstrated that the im2win convolution with the new NHWC layout achieved up to 355% performance speedup over NCHW layout. Our optimizations also significantly improve the performance of both im2win and direct convolutions. Our optimized im2win and direct convolutions achieved up to 95% and 94% of machine's theoretical peak performance, respectively.

Related papers

ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation [0.34952465649465553]
This paper proposes ConvBench, a primitive-level benchmark for the evaluation and comparison of convolution algorithms. It assesses 9243 convolution operations derived from 1097 real-world deep learning models. The experiments showed results faster than Im2col-GEMM in 93.6% of the convolutions.
arXiv Detail & Related papers (2024-07-15T13:58:24Z)
Im2win: Memory Efficient Convolution On SIMD Architectures [2.153650601445911]
We propose a new memory-efficient data transformation algorithm, called im2win. Our results show that our algorithm reduces the memory overhead by average to 41.6% compared to the PyTorch's convolution implementation.
arXiv Detail & Related papers (2023-06-25T19:21:10Z)
Im2win: An Efficient Convolution Paradigm on GPU [1.9162301033784574]
This paper proposes a paradigm on convolution-based convolutions called im2win, which only reduces memory footprint but also offers continuous memory accesses. We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution, and six$$ DNN-based convolution implementations, with twelve state-of-the-art benchmarks.
arXiv Detail & Related papers (2023-06-25T19:09:56Z)
Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions [1.2006896500048552]
Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. This paper proposes SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers.
arXiv Detail & Related papers (2023-03-08T17:23:39Z)
Content-Aware Convolutional Neural Networks [98.97634685964819]
Convolutional Neural Networks (CNNs) have achieved great success due to the powerful feature learning ability of convolution layers. We propose a Content-aware Convolution (CAC) that automatically detects the smooth windows and applies a 1x1 convolutional kernel to replace the original large kernel.
arXiv Detail & Related papers (2021-06-30T03:54:35Z)
Involution: Inverting the Inherence of Convolution for Visual Recognition [72.88582255910835]
We present a novel atomic operation for deep neural networks by inverting the principles of convolution, coined as involution. The proposed involution operator could be leveraged as fundamental bricks to build the new generation of neural networks for visual recognition. Our involution-based models improve the performance of convolutional baselines using ResNet-50 by up to 1.6% top-1 accuracy, 2.5% and 2.4% bounding box AP, and 4.7% mean IoU absolutely.
arXiv Detail & Related papers (2021-03-10T18:40:46Z)
I/O Lower Bounds for Auto-tuning of Convolutions in CNNs [2.571796445061562]
We develop a general I/O lower bound theory for a composite algorithm which consists of several different sub-computations. We design the near I/O-optimal dataflow strategies for the two main convolution algorithms by fully exploiting the data reuse. Experiment results show that our dataflow strategies with the auto-tuning approach can achieve about 3.32x performance speedup on average over cuDNN.
arXiv Detail & Related papers (2020-12-31T15:46:01Z)
Inception Convolution with Efficient Dilation Search [121.41030859447487]
Dilation convolution is a critical mutant of standard convolution neural network to control effective receptive fields and handle large scale variance of objects. We propose a new mutant of dilated convolution, namely inception (dilated) convolution where the convolutions have independent dilation among different axes, channels and layers. We explore a practical method for fitting the complex inception convolution to the data, a simple while effective dilation search algorithm(EDO) based on statistical optimization is developed.
arXiv Detail & Related papers (2020-12-25T14:58:35Z)
Binarizing MobileNet via Evolution-based Searching [66.94247681870125]
We propose a use of evolutionary search to facilitate the construction and training scheme when binarizing MobileNet. Inspired by one-shot architecture search frameworks, we manipulate the idea of group convolution to design efficient 1-Bit Convolutional Neural Networks (CNNs) Our objective is to come up with a tiny yet efficient binary neural architecture by exploring the best candidates of the group convolution.
arXiv Detail & Related papers (2020-05-13T13:25:51Z)
Training Binary Neural Networks with Real-to-Binary Convolutions [52.91164959767517]
We show how to train binary networks to within a few percent points of the full precision counterpart. We show how to build a strong baseline, which already achieves state-of-the-art accuracy. We show that, when putting all of our improvements together, the proposed model beats the current state of the art by more than 5% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2020-03-25T17:54:38Z)
XSepConv: Extremely Separated Convolution [60.90871656244126]
We propose a novel extremely separated convolutional block (XSepConv) It fuses spatially separable convolutions into depthwise convolution to reduce both the computational cost and parameter size of large kernels. XSepConv is designed to be an efficient alternative to vanilla depthwise convolution with large kernel sizes.
arXiv Detail & Related papers (2020-02-27T11:46:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.