Advancing Direct Convolution using Convolution Slicing Optimization and
ISA Extensions
- URL: http://arxiv.org/abs/2303.04739v1
- Date: Wed, 8 Mar 2023 17:23:39 GMT
- Title: Advancing Direct Convolution using Convolution Slicing Optimization and
ISA Extensions
- Authors: Victor Ferrari, Rafael Sousa, Marcio Pereira, Jo\~ao P. L. de
Carvalho, Jos\'e Nelson Amaral, Jos\'e Moreira, Guido Araujo
- Abstract summary: Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference.
This paper proposes SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers.
- Score: 1.2006896500048552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolution is one of the most computationally intensive operations that must
be performed for machine-learning model inference. A traditional approach to
compute convolutions is known as the Im2Col + BLAS method. This paper proposes
SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation
toolchain that can be integrated into machine-learning compilers . This
algorithm introduces: (a) Convolution Slicing Analysis (CSA) - a
convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse
over the cache hierarchy; (b) Convolution Slicing Optimization (CSO) - a
code-generation pass that uses CSA to generate a tiled direct-convolution
macro-kernel; and (c) Vector-Based Packing (VBP) - an architecture-specific
optimized input-tensor packing solution based on vector-register shift
instructions for convolutions with unitary stride. Experiments conducted on 393
convolutions from full ONNX-MLIR machine-learning models indicate that the
elimination of the Im2Col transformation and the use of fast packing routines
result in a total packing time reduction, on full model inference, of 2.0x -
3.9x on Intel x86 and 3.6x - 7.2x on IBM POWER10. The speed-up over an Im2Col +
BLAS method based on current BLAS implementations for end-to-end
machine-learning model inference is in the range of 9% - 25% for Intel x86 and
10% - 42% for IBM POWER10 architectures. The total convolution speedup for
model inference is 12% - 27% on Intel x86 and 26% - 46% on IBM POWER10. SConv
also outperforms BLAS GEMM, when computing pointwise convolutions, in more than
83% of the 219 tested instances.
Related papers
- High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures [26.146937503081876]
This paper proposes three novel data layouts for im2win convolution: NHWC, CHWN, and CHWN8.
We compare the optimized im2win convolution with the direct convolution and PyTorch's im2col-based convolution across the aforementioned layouts on SIMD machines.
Our optimized im2win and direct convolutions achieved up to 95% and 94% of machine's theoretical peak performance, respectively.
arXiv Detail & Related papers (2024-08-01T04:37:03Z) - ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation [0.34952465649465553]
This paper proposes ConvBench, a primitive-level benchmark for the evaluation and comparison of convolution algorithms.
It assesses 9243 convolution operations derived from 1097 real-world deep learning models.
The experiments showed results faster than Im2col-GEMM in 93.6% of the convolutions.
arXiv Detail & Related papers (2024-07-15T13:58:24Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - ACPO: AI-Enabled Compiler Framework [1.752593459729982]
This paper presents ACPO: An AI-Enabled Compiler Framework.
It provides LLVM with simple and comprehensive tools to benefit from employing ML models for different optimization passes.
We show that ACPO can provide a combined speedup of 4.5% on Polybench and 2.4% on Cbench when compared with LLVM's O3.
arXiv Detail & Related papers (2023-12-15T17:49:24Z) - Im2win: Memory Efficient Convolution On SIMD Architectures [2.153650601445911]
We propose a new memory-efficient data transformation algorithm, called im2win.
Our results show that our algorithm reduces the memory overhead by average to 41.6% compared to the PyTorch's convolution implementation.
arXiv Detail & Related papers (2023-06-25T19:21:10Z) - Im2win: An Efficient Convolution Paradigm on GPU [1.9162301033784574]
This paper proposes a paradigm on convolution-based convolutions called im2win, which only reduces memory footprint but also offers continuous memory accesses.
We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution, and six$$ DNN-based convolution implementations, with twelve state-of-the-art benchmarks.
arXiv Detail & Related papers (2023-06-25T19:09:56Z) - HDCC: A Hyperdimensional Computing compiler for classification on
embedded systems and high-performance computing [58.720142291102135]
This work introduces the name compiler, the first open-source compiler that translates high-level descriptions of HDC classification methods into optimized C code.
name is designed like a modern compiler, featuring an intuitive and descriptive input language, an intermediate representation (IR), and a retargetable backend.
To substantiate these claims, we conducted experiments with HDCC on several of the most popular datasets in the HDC literature.
arXiv Detail & Related papers (2023-04-24T19:16:03Z) - Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture.
We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Inception Convolution with Efficient Dilation Search [121.41030859447487]
Dilation convolution is a critical mutant of standard convolution neural network to control effective receptive fields and handle large scale variance of objects.
We propose a new mutant of dilated convolution, namely inception (dilated) convolution where the convolutions have independent dilation among different axes, channels and layers.
We explore a practical method for fitting the complex inception convolution to the data, a simple while effective dilation search algorithm(EDO) based on statistical optimization is developed.
arXiv Detail & Related papers (2020-12-25T14:58:35Z) - Dynamic Region-Aware Convolution [85.20099799084026]
We propose a new convolution called Dynamic Region-Aware Convolution (DRConv), which can automatically assign multiple filters to corresponding spatial regions.
On ImageNet classification, DRConv-based ShuffleNetV2-0.5x achieves state-of-the-art performance of 67.1% at 46M multiply-adds level with 6.3% relative improvement.
arXiv Detail & Related papers (2020-03-27T05:49:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.