Im2win: An Efficient Convolution Paradigm on GPU
- URL: http://arxiv.org/abs/2306.14316v1
- Date: Sun, 25 Jun 2023 19:09:56 GMT
- Title: Im2win: An Efficient Convolution Paradigm on GPU
- Authors: Shuai Lu and Jun Chu and Luanzheng Guo and Xu T. Liu
- Abstract summary: This paper proposes a paradigm on convolution-based convolutions called im2win, which only reduces memory footprint but also offers continuous memory accesses.
We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution, and six$$ DNN-based convolution implementations, with twelve state-of-the-art benchmarks.
- Score: 1.9162301033784574
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Convolution is the most time-consuming operation in deep neural network
operations, so its performance is critical to the overall performance of the
neural network. The commonly used methods for convolution on GPU include the
general matrix multiplication (GEMM)-based convolution and the direct
convolution. GEMM-based convolution relies on the im2col algorithm, which
results in a large memory footprint and reduced performance. Direct convolution
does not have the large memory footprint problem, but the performance is not on
par with GEMM-based approach because of the discontinuous memory access. This
paper proposes a window-order-based convolution paradigm on GPU, called im2win,
which not only reduces memory footprint but also offers continuous memory
accesses, resulting in improved performance. Furthermore, we apply a range of
optimization techniques on the convolution CUDA kernel, including shared
memory, tiling, micro-kernel, double buffer, and prefetching. We compare our
implementation with the direct convolution, and PyTorch's GEMM-based
convolution with cuBLAS and six cuDNN-based convolution implementations, with
twelve state-of-the-art DNN benchmarks. The experimental results show that our
implementation 1) uses less memory footprint by 23.1% and achieves 3.5$\times$
TFLOPS compared with cuBLAS, 2) uses less memory footprint by 32.8% and
achieves up to 1.8$\times$ TFLOPS compared with the best performant
convolutions in cuDNN, and 3) achieves up to 155$\times$ TFLOPS compared with
the direct convolution. We further perform an ablation study on the applied
optimization techniques and find that the micro-kernel has the greatest
positive impact on performance.
Related papers
- Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures [26.146937503081876]
This paper proposes three novel data layouts for im2win convolution: NHWC, CHWN, and CHWN8.
We compare the optimized im2win convolution with the direct convolution and PyTorch's im2col-based convolution across the aforementioned layouts on SIMD machines.
Our optimized im2win and direct convolutions achieved up to 95% and 94% of machine's theoretical peak performance, respectively.
arXiv Detail & Related papers (2024-08-01T04:37:03Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation [0.34952465649465553]
This paper proposes ConvBench, a primitive-level benchmark for the evaluation and comparison of convolution algorithms.
It assesses 9243 convolution operations derived from 1097 real-world deep learning models.
The experiments showed results faster than Im2col-GEMM in 93.6% of the convolutions.
arXiv Detail & Related papers (2024-07-15T13:58:24Z) - Im2win: Memory Efficient Convolution On SIMD Architectures [2.153650601445911]
We propose a new memory-efficient data transformation algorithm, called im2win.
Our results show that our algorithm reduces the memory overhead by average to 41.6% compared to the PyTorch's convolution implementation.
arXiv Detail & Related papers (2023-06-25T19:21:10Z) - Distributed Out-of-Memory NMF on CPU/GPU Architectures [1.0051474951635875]
We propose an efficient out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for HPC systems.
Benchmark results show significant improvement of 32X to 76x speedup with the new implementation using GPU over the CPU-based NMFk.
arXiv Detail & Related papers (2022-02-19T03:49:21Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - Content-Aware Convolutional Neural Networks [98.97634685964819]
Convolutional Neural Networks (CNNs) have achieved great success due to the powerful feature learning ability of convolution layers.
We propose a Content-aware Convolution (CAC) that automatically detects the smooth windows and applies a 1x1 convolutional kernel to replace the original large kernel.
arXiv Detail & Related papers (2021-06-30T03:54:35Z) - Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters.
It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions.
We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z) - XSepConv: Extremely Separated Convolution [60.90871656244126]
We propose a novel extremely separated convolutional block (XSepConv)
It fuses spatially separable convolutions into depthwise convolution to reduce both the computational cost and parameter size of large kernels.
XSepConv is designed to be an efficient alternative to vanilla depthwise convolution with large kernel sizes.
arXiv Detail & Related papers (2020-02-27T11:46:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.