Im2win: Memory Efficient Convolution On SIMD Architectures
- URL: http://arxiv.org/abs/2306.14320v1
- Date: Sun, 25 Jun 2023 19:21:10 GMT
- Title: Im2win: Memory Efficient Convolution On SIMD Architectures
- Authors: Shuai Lu and Jun Chu and Xu T. Liu
- Abstract summary: We propose a new memory-efficient data transformation algorithm, called im2win.
Our results show that our algorithm reduces the memory overhead by average to 41.6% compared to the PyTorch's convolution implementation.
- Score: 2.153650601445911
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Convolution is the most expensive operation among neural network operations,
thus its performance is critical to the overall performance of neural networks.
Commonly used convolution approaches, including general matrix multiplication
(GEMM)-based convolution and direct convolution, rely on im2col for data
transformation or do not use data transformation at all, respectively. However,
the im2col data transformation can lead to at least 2$\times$ memory footprint
compared to not using data transformation at all, thus limiting the size of
neural network models running on memory-limited systems. Meanwhile, not using
data transformation usually performs poorly due to nonconsecutive memory access
although it consumes less memory. To solve those problems, we propose a new
memory-efficient data transformation algorithm, called im2win. This algorithm
refactorizes a row of square or rectangle dot product windows of the input
image and flattens unique elements within these windows into a row in the
output tensor, which enables consecutive memory access and data reuse, and thus
greatly reduces the memory overhead. Furthermore, we propose a high-performance
im2win-based convolution algorithm with various optimizations, including
vectorization, loop reordering, etc. Our experimental results show that our
algorithm reduces the memory overhead by average to 41.6% compared to the
PyTorch's convolution implementation based on im2col, and achieves average to
3.6$\times$ and 5.3$\times$ speedup in performance compared to the im2col-based
convolution and not using data transformation, respectively.
Related papers
- Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures [26.146937503081876]
This paper proposes three novel data layouts for im2win convolution: NHWC, CHWN, and CHWN8.
We compare the optimized im2win convolution with the direct convolution and PyTorch's im2col-based convolution across the aforementioned layouts on SIMD machines.
Our optimized im2win and direct convolutions achieved up to 95% and 94% of machine's theoretical peak performance, respectively.
arXiv Detail & Related papers (2024-08-01T04:37:03Z) - Decreasing the Computing Time of Bayesian Optimization using
Generalizable Memory Pruning [56.334116591082896]
We show a wrapper of memory pruning and bounded optimization capable of being used with any surrogate model and acquisition function.
Running BO on high-dimensional or massive data sets becomes intractable due to this time complexity.
All model implementations are run on the MIT Supercloud state-of-the-art computing hardware.
arXiv Detail & Related papers (2023-09-08T14:05:56Z) - Eva: A General Vectorized Approximation Framework for Second-order
Optimization [16.647611352181574]
We present a memory- and time-efficient second-order algorithm named Eva with two novel techniques.
We derive an efficient update formula without explicitly computing the inverse of using the Sherman-Morrison formula.
Experiments show that Eva reduces the end-to-end training time up to 2.05x and 2.42x compared to first-order SGD and second-order algorithms.
arXiv Detail & Related papers (2023-08-04T03:51:38Z) - Im2win: An Efficient Convolution Paradigm on GPU [1.9162301033784574]
This paper proposes a paradigm on convolution-based convolutions called im2win, which only reduces memory footprint but also offers continuous memory accesses.
We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution, and six$$ DNN-based convolution implementations, with twelve state-of-the-art benchmarks.
arXiv Detail & Related papers (2023-06-25T19:09:56Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Kernel-Segregated Transpose Convolution Operation [2.9822184411723645]
Transpose convolution layers are computationally intensive due to the increased feature map size due to adding zeros after each element in each row and column.
We propose an algorithmic-level optimization technique for the effective transpose convolution implementation to solve these problems.
arXiv Detail & Related papers (2022-09-08T10:42:49Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - Content-Aware Convolutional Neural Networks [98.97634685964819]
Convolutional Neural Networks (CNNs) have achieved great success due to the powerful feature learning ability of convolution layers.
We propose a Content-aware Convolution (CAC) that automatically detects the smooth windows and applies a 1x1 convolutional kernel to replace the original large kernel.
arXiv Detail & Related papers (2021-06-30T03:54:35Z) - Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters.
It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions.
We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.