Efficient and Generic 1D Dilated Convolution Layer for Deep Learning
- URL: http://arxiv.org/abs/2104.08002v1
- Date: Fri, 16 Apr 2021 09:54:30 GMT
- Title: Efficient and Generic 1D Dilated Convolution Layer for Deep Learning
- Authors: Narendra Chaudhary, Sanchit Misra, Dhiraj Kalamkar, Alexander
Heinecke, Evangelos Georganas, Barukh Ziv, Menachem Adelman, Bharat Kaul
- Abstract summary: We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters.
It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions.
We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
- Score: 52.899995651639436
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Convolutional neural networks (CNNs) have found many applications in tasks
involving two-dimensional (2D) data, such as image classification and image
processing. Therefore, 2D convolution layers have been heavily optimized on
CPUs and GPUs. However, in many applications - for example genomics and speech
recognition, the data can be one-dimensional (1D). Such applications can
benefit from optimized 1D convolution layers. In this work, we introduce our
efficient implementation of a generic 1D convolution layer covering a wide
range of parameters. It is optimized for x86 CPU architectures, in particular,
for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions.
We use the LIBXSMM library's batch-reduce General Matrix Multiplication
(BRGEMM) kernel for FP32 and BFloat16 precision. We demonstrate that our
implementation can achieve up to 80% efficiency on Intel Xeon Cascade Lake and
Cooper Lake CPUs. Additionally, we show the generalization capability of our
BRGEMM based approach by achieving high efficiency across a range of
parameters. We consistently achieve higher efficiency than the 1D convolution
layer with Intel oneDNN library backend for varying input tensor widths, filter
widths, number of channels, filters, and dilation parameters. Finally, we
demonstrate the performance of our optimized 1D convolution layer by utilizing
it in the end-to-end neural network training with real genomics datasets and
achieve up to 6.86x speedup over the oneDNN library-based implementation on
Cascade Lake CPUs. We also demonstrate the scaling with 16 sockets of
Cascade/Cooper Lake CPUs and achieve significant speedup over eight V100 GPUs
using a similar power envelop. In the end-to-end training, we get a speedup of
1.41x on Cascade Lake with FP32, 1.57x on Cooper Lake with FP32, and 2.27x on
Cooper Lake with BFloat16 over eight V100 GPUs with FP32.
Related papers
- Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs [3.7101665559244874]
This paper presents a SYCL implementation of Multi-formedLayer Perceptrons (MLPs) for the Intel Data Center GPU Max 1550.
We show with a simple model that this results in a significant increase in arithmetic intensity, leading to improved performance, especially for inference.
arXiv Detail & Related papers (2024-03-26T11:38:39Z) - TorchSparse++: Efficient Training and Inference Framework for Sparse
Convolution on GPUs [20.4238781638402]
Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving, and graph understanding in recommendation systems.
Existing GPU libraries offer two dataflow types for sparse convolution.
We introduce TorchSparse++, a new GPU library that achieves the best of both worlds.
arXiv Detail & Related papers (2023-10-25T21:02:38Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Im2win: An Efficient Convolution Paradigm on GPU [1.9162301033784574]
This paper proposes a paradigm on convolution-based convolutions called im2win, which only reduces memory footprint but also offers continuous memory accesses.
We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution, and six$$ DNN-based convolution implementations, with twelve state-of-the-art benchmarks.
arXiv Detail & Related papers (2023-06-25T19:09:56Z) - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups.
Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K.
Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z) - Distributed Out-of-Memory NMF on CPU/GPU Architectures [1.0051474951635875]
We propose an efficient out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for HPC systems.
Benchmark results show significant improvement of 32X to 76x speedup with the new implementation using GPU over the CPU-based NMFk.
arXiv Detail & Related papers (2022-02-19T03:49:21Z) - Content-Aware Convolutional Neural Networks [98.97634685964819]
Convolutional Neural Networks (CNNs) have achieved great success due to the powerful feature learning ability of convolution layers.
We propose a Content-aware Convolution (CAC) that automatically detects the smooth windows and applies a 1x1 convolutional kernel to replace the original large kernel.
arXiv Detail & Related papers (2021-06-30T03:54:35Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z) - At-Scale Sparse Deep Neural Network Inference with Efficient GPU
Implementation [24.824295164938604]
This paper presents GPU performance optimization and scaling results for inference models of the Sparse Deep Neural Network Challenge 2020.
Sparse deep neural networks (SpDNN) have shown promise for reining in the memory footprint of large neural networks.
This work presents optimized sparse matrix multiplication kernels fused with the ReLU function.
arXiv Detail & Related papers (2020-07-28T12:09:43Z) - FBNetV2: Differentiable Neural Architecture Search for Spatial and
Channel Dimensions [70.59851564292828]
Differentiable Neural Architecture Search (DNAS) has demonstrated great success in designing state-of-the-art, efficient neural networks.
We propose a memory and computationally efficient DNAS variant: DMaskingNAS.
This algorithm expands the search space by up to $1014times$ over conventional DNAS.
arXiv Detail & Related papers (2020-04-12T08:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.