Kernel-Segregated Transpose Convolution Operation
- URL: http://arxiv.org/abs/2209.03704v1
- Date: Thu, 8 Sep 2022 10:42:49 GMT
- Title: Kernel-Segregated Transpose Convolution Operation
- Authors: Vijay Srinivas Tida, Sai Venkatesh Chilukoti, Xiali Hei, Sonya Hsu
- Abstract summary: Transpose convolution layers are computationally intensive due to the increased feature map size due to adding zeros after each element in each row and column.
We propose an algorithmic-level optimization technique for the effective transpose convolution implementation to solve these problems.
- Score: 2.9822184411723645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transpose convolution has shown prominence in many deep learning
applications. However, transpose convolution layers are computationally
intensive due to the increased feature map size due to adding zeros after each
element in each row and column. Thus, convolution operation on the expanded
input feature map leads to poor utilization of hardware resources. The main
reason for unnecessary multiplication operations is zeros at predefined
positions in the input feature map. We propose an algorithmic-level
optimization technique for the effective transpose convolution implementation
to solve these problems. Based on kernel activations, we segregated the
original kernel into four sub-kernels. This scheme could reduce memory
requirements and unnecessary multiplications. Our proposed method was $3.09
(3.02) \times$ faster computation using the Titan X GPU (Intel Dual Core CPU)
with a flower dataset from the Kaggle website. Furthermore, the proposed
optimization method can be generalized to existing devices without additional
hardware requirements. A simple deep learning model containing one transpose
convolution layer was used to evaluate the optimization method. It showed $2.2
\times$ faster training using the MNIST dataset with an Intel Dual-core CPU
than the conventional implementation.
Related papers
- INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Reduce Computational Complexity for Convolutional Layers by Skipping Zeros [9.833821501774596]
We propose an efficient algorithm for convolutional neural networks.
The C-K-S algorithm is accompanied by efficient GPU implementations.
Experiments show that C-K-S offers good performance in terms of speed and convergence.
arXiv Detail & Related papers (2023-06-28T06:21:22Z) - Im2win: Memory Efficient Convolution On SIMD Architectures [2.153650601445911]
We propose a new memory-efficient data transformation algorithm, called im2win.
Our results show that our algorithm reduces the memory overhead by average to 41.6% compared to the PyTorch's convolution implementation.
arXiv Detail & Related papers (2023-06-25T19:21:10Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Content-Aware Convolutional Neural Networks [98.97634685964819]
Convolutional Neural Networks (CNNs) have achieved great success due to the powerful feature learning ability of convolution layers.
We propose a Content-aware Convolution (CAC) that automatically detects the smooth windows and applies a 1x1 convolutional kernel to replace the original large kernel.
arXiv Detail & Related papers (2021-06-30T03:54:35Z) - VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator.
textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z) - Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters.
It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions.
We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z) - GPU-Accelerated Optimizer-Aware Evaluation of Submodular Exemplar
Clustering [5.897728689802829]
optimization of submodular functions constitutes a viable way to perform clustering.
Strong approximation guarantees and feasible optimization w.r.t. streaming data make this clustering approach favorable.
Exemplar-based clustering is one of the possible submodular functions, but suffers from high computational complexity.
Half-precision GPU computation led to large speedups of up to 452x compared to single-precision, single-thread CPU computations.
arXiv Detail & Related papers (2021-01-21T18:23:44Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms [1.3249453757295084]
We study training algorithms for deep learning on heterogeneous CPU+GPU architectures.
Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging.
We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
arXiv Detail & Related papers (2020-04-19T05:21:20Z) - Efficient Tensor Kernel methods for sparse regression [39.95662930240854]
We introduce suitable tensor kernels to promote sparsity in the solution of the underlying regression problem.
storing tensors requires a considerable amount of memory, ultimately limiting its applicability.
First, we directly reduce the memory requirement, by intriducing a new and more efficient layout for storing the data.
Second, we use a Nystrom-type subsampling approach, which allows for a training phase with a smaller number of data points, so to reduce the computational cost.
arXiv Detail & Related papers (2020-03-23T18:26:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.