Advancing Direct Convolution using Convolution Slicing Optimization and
ISA Extensions
- URL: http://arxiv.org/abs/2303.04739v1
- Date: Wed, 8 Mar 2023 17:23:39 GMT
- Title: Advancing Direct Convolution using Convolution Slicing Optimization and
ISA Extensions
- Authors: Victor Ferrari, Rafael Sousa, Marcio Pereira, Jo\~ao P. L. de
Carvalho, Jos\'e Nelson Amaral, Jos\'e Moreira, Guido Araujo
- Abstract summary: Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference.
This paper proposes SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers.
- Score: 1.2006896500048552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolution is one of the most computationally intensive operations that must
be performed for machine-learning model inference. A traditional approach to
compute convolutions is known as the Im2Col + BLAS method. This paper proposes
SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation
toolchain that can be integrated into machine-learning compilers . This
algorithm introduces: (a) Convolution Slicing Analysis (CSA) - a
convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse
over the cache hierarchy; (b) Convolution Slicing Optimization (CSO) - a
code-generation pass that uses CSA to generate a tiled direct-convolution
macro-kernel; and (c) Vector-Based Packing (VBP) - an architecture-specific
optimized input-tensor packing solution based on vector-register shift
instructions for convolutions with unitary stride. Experiments conducted on 393
convolutions from full ONNX-MLIR machine-learning models indicate that the
elimination of the Im2Col transformation and the use of fast packing routines
result in a total packing time reduction, on full model inference, of 2.0x -
3.9x on Intel x86 and 3.6x - 7.2x on IBM POWER10. The speed-up over an Im2Col +
BLAS method based on current BLAS implementations for end-to-end
machine-learning model inference is in the range of 9% - 25% for Intel x86 and
10% - 42% for IBM POWER10 architectures. The total convolution speedup for
model inference is 12% - 27% on Intel x86 and 26% - 46% on IBM POWER10. SConv
also outperforms BLAS GEMM, when computing pointwise convolutions, in more than
83% of the 219 tested instances.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures [26.146937503081876]
This paper proposes three novel data layouts for im2win convolution: NHWC, CHWN, and CHWN8.
We compare the optimized im2win convolution with the direct convolution and PyTorch's im2col-based convolution across the aforementioned layouts on SIMD machines.
Our optimized im2win and direct convolutions achieved up to 95% and 94% of machine's theoretical peak performance, respectively.
arXiv Detail & Related papers (2024-08-01T04:37:03Z) - ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation [0.34952465649465553]
This paper proposes ConvBench, a primitive-level benchmark for the evaluation and comparison of convolution algorithms.
It assesses 9243 convolution operations derived from 1097 real-world deep learning models.
The experiments showed results faster than Im2col-GEMM in 93.6% of the convolutions.
arXiv Detail & Related papers (2024-07-15T13:58:24Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - Im2win: Memory Efficient Convolution On SIMD Architectures [2.153650601445911]
We propose a new memory-efficient data transformation algorithm, called im2win.
Our results show that our algorithm reduces the memory overhead by average to 41.6% compared to the PyTorch's convolution implementation.
arXiv Detail & Related papers (2023-06-25T19:21:10Z) - Im2win: An Efficient Convolution Paradigm on GPU [1.9162301033784574]
This paper proposes a paradigm on convolution-based convolutions called im2win, which only reduces memory footprint but also offers continuous memory accesses.
We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution, and six$$ DNN-based convolution implementations, with twelve state-of-the-art benchmarks.
arXiv Detail & Related papers (2023-06-25T19:09:56Z) - HEAT: A Highly Efficient and Affordable Training System for
Collaborative Filtering Based Recommendation on CPUs [11.007606356081435]
Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation.
There is no work that optimized SimpleX on multi-core CPUs, leading to limited performance.
We propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs.
arXiv Detail & Related papers (2023-04-14T18:07:26Z) - Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture.
We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Inception Convolution with Efficient Dilation Search [121.41030859447487]
Dilation convolution is a critical mutant of standard convolution neural network to control effective receptive fields and handle large scale variance of objects.
We propose a new mutant of dilated convolution, namely inception (dilated) convolution where the convolutions have independent dilation among different axes, channels and layers.
We explore a practical method for fitting the complex inception convolution to the data, a simple while effective dilation search algorithm(EDO) based on statistical optimization is developed.
arXiv Detail & Related papers (2020-12-25T14:58:35Z) - Dynamic Region-Aware Convolution [85.20099799084026]
We propose a new convolution called Dynamic Region-Aware Convolution (DRConv), which can automatically assign multiple filters to corresponding spatial regions.
On ImageNet classification, DRConv-based ShuffleNetV2-0.5x achieves state-of-the-art performance of 67.1% at 46M multiply-adds level with 6.3% relative improvement.
arXiv Detail & Related papers (2020-03-27T05:49:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.