Advancing Direct Convolution using Convolution Slicing Optimization and
ISA Extensions
- URL: http://arxiv.org/abs/2303.04739v1
- Date: Wed, 8 Mar 2023 17:23:39 GMT
- Title: Advancing Direct Convolution using Convolution Slicing Optimization and
ISA Extensions
- Authors: Victor Ferrari, Rafael Sousa, Marcio Pereira, Jo\~ao P. L. de
Carvalho, Jos\'e Nelson Amaral, Jos\'e Moreira, Guido Araujo
- Abstract summary: Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference.
This paper proposes SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers.
- Score: 1.2006896500048552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolution is one of the most computationally intensive operations that must
be performed for machine-learning model inference. A traditional approach to
compute convolutions is known as the Im2Col + BLAS method. This paper proposes
SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation
toolchain that can be integrated into machine-learning compilers . This
algorithm introduces: (a) Convolution Slicing Analysis (CSA) - a
convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse
over the cache hierarchy; (b) Convolution Slicing Optimization (CSO) - a
code-generation pass that uses CSA to generate a tiled direct-convolution
macro-kernel; and (c) Vector-Based Packing (VBP) - an architecture-specific
optimized input-tensor packing solution based on vector-register shift
instructions for convolutions with unitary stride. Experiments conducted on 393
convolutions from full ONNX-MLIR machine-learning models indicate that the
elimination of the Im2Col transformation and the use of fast packing routines
result in a total packing time reduction, on full model inference, of 2.0x -
3.9x on Intel x86 and 3.6x - 7.2x on IBM POWER10. The speed-up over an Im2Col +
BLAS method based on current BLAS implementations for end-to-end
machine-learning model inference is in the range of 9% - 25% for Intel x86 and
10% - 42% for IBM POWER10 architectures. The total convolution speedup for
model inference is 12% - 27% on Intel x86 and 26% - 46% on IBM POWER10. SConv
also outperforms BLAS GEMM, when computing pointwise convolutions, in more than
83% of the 219 tested instances.
Related papers
- ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation [0.34952465649465553]
This paper proposes ConvBench, a primitive-level benchmark for the evaluation and comparison of convolution algorithms.
It assesses 9243 convolution operations derived from 1097 real-world deep learning models.
The experiments showed results faster than Im2col-GEMM in 93.6% of the convolutions.
arXiv Detail & Related papers (2024-07-15T13:58:24Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
AQLM is first scheme that is optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter.
We provide fast GPU and CPU implementations of AQLM for token generation.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - Im2win: Memory Efficient Convolution On SIMD Architectures [2.153650601445911]
We propose a new memory-efficient data transformation algorithm, called im2win.
Our results show that our algorithm reduces the memory overhead by average to 41.6% compared to the PyTorch's convolution implementation.
arXiv Detail & Related papers (2023-06-25T19:21:10Z) - Im2win: An Efficient Convolution Paradigm on GPU [1.9162301033784574]
This paper proposes a paradigm on convolution-based convolutions called im2win, which only reduces memory footprint but also offers continuous memory accesses.
We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution, and six$$ DNN-based convolution implementations, with twelve state-of-the-art benchmarks.
arXiv Detail & Related papers (2023-06-25T19:09:56Z) - HDCC: A Hyperdimensional Computing compiler for classification on
embedded systems and high-performance computing [58.720142291102135]
This work introduces the name compiler, the first open-source compiler that translates high-level descriptions of HDC classification methods into optimized C code.
name is designed like a modern compiler, featuring an intuitive and descriptive input language, an intermediate representation (IR), and a retargetable backend.
To substantiate these claims, we conducted experiments with HDCC on several of the most popular datasets in the HDC literature.
arXiv Detail & Related papers (2023-04-24T19:16:03Z) - HEAT: A Highly Efficient and Affordable Training System for
Collaborative Filtering Based Recommendation on CPUs [11.007606356081435]
Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation.
There is no work that optimized SimpleX on multi-core CPUs, leading to limited performance.
We propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs.
arXiv Detail & Related papers (2023-04-14T18:07:26Z) - Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture.
We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Inception Convolution with Efficient Dilation Search [121.41030859447487]
Dilation convolution is a critical mutant of standard convolution neural network to control effective receptive fields and handle large scale variance of objects.
We propose a new mutant of dilated convolution, namely inception (dilated) convolution where the convolutions have independent dilation among different axes, channels and layers.
We explore a practical method for fitting the complex inception convolution to the data, a simple while effective dilation search algorithm(EDO) based on statistical optimization is developed.
arXiv Detail & Related papers (2020-12-25T14:58:35Z) - Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [14.958793135751149]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM)
Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead.
We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
arXiv Detail & Related papers (2020-09-04T20:17:42Z) - Dynamic Region-Aware Convolution [85.20099799084026]
We propose a new convolution called Dynamic Region-Aware Convolution (DRConv), which can automatically assign multiple filters to corresponding spatial regions.
On ImageNet classification, DRConv-based ShuffleNetV2-0.5x achieves state-of-the-art performance of 67.1% at 46M multiply-adds level with 6.3% relative improvement.
arXiv Detail & Related papers (2020-03-27T05:49:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.