Related papers: Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions

Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions

URL: http://arxiv.org/abs/2303.04739v1
Date: Wed, 8 Mar 2023 17:23:39 GMT
Title: Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions
Authors: Victor Ferrari, Rafael Sousa, Marcio Pereira, Jo\~ao P. L. de Carvalho, Jos\'e Nelson Amaral, Jos\'e Moreira, Guido Araujo
Abstract summary: Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. This paper proposes SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers.
Score: 1.2006896500048552
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. A traditional approach to compute convolutions is known as the Im2Col + BLAS method. This paper proposes SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers . This algorithm introduces: (a) Convolution Slicing Analysis (CSA) - a convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse over the cache hierarchy; (b) Convolution Slicing Optimization (CSO) - a code-generation pass that uses CSA to generate a tiled direct-convolution macro-kernel; and (c) Vector-Based Packing (VBP) - an architecture-specific optimized input-tensor packing solution based on vector-register shift instructions for convolutions with unitary stride. Experiments conducted on 393 convolutions from full ONNX-MLIR machine-learning models indicate that the elimination of the Im2Col transformation and the use of fast packing routines result in a total packing time reduction, on full model inference, of 2.0x - 3.9x on Intel x86 and 3.6x - 7.2x on IBM POWER10. The speed-up over an Im2Col + BLAS method based on current BLAS implementations for end-to-end machine-learning model inference is in the range of 9% - 25% for Intel x86 and 10% - 42% for IBM POWER10 architectures. The total convolution speedup for model inference is 12% - 27% on Intel x86 and 26% - 46% on IBM POWER10. SConv also outperforms BLAS GEMM, when computing pointwise convolutions, in more than 83% of the 219 tested instances.

Related papers

The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference [0.9954176833299684]
Deep learning (DL) has led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats.<n>This paper revisits traditional high-performance gemm and describes strategies for adapting it to mixed-precision integer arithmetic.
arXiv Detail & Related papers (2025-06-13T12:40:16Z)
Unified Kernel-Segregated Transpose Convolution Operation [3.4558311080267954]
We introduce a unified kernel segregation approach that limits the usage of memory and computational resources. The proposed method for the transpose convolution layers in the EB-GAN model demonstrates significant memory savings of up to 35 MB.
arXiv Detail & Related papers (2025-02-27T19:56:25Z)
Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale [68.6602625868888]
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations. Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression. We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
arXiv Detail & Related papers (2025-02-25T19:47:20Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE. Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures [26.146937503081876]
This paper proposes three novel data layouts for im2win convolution: NHWC, CHWN, and CHWN8. We compare the optimized im2win convolution with the direct convolution and PyTorch's im2col-based convolution across the aforementioned layouts on SIMD machines. Our optimized im2win and direct convolutions achieved up to 95% and 94% of machine's theoretical peak performance, respectively.
arXiv Detail & Related papers (2024-08-01T04:37:03Z)
ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation [0.34952465649465553]
This paper proposes ConvBench, a primitive-level benchmark for the evaluation and comparison of convolution algorithms. It assesses 9243 convolution operations derived from 1097 real-world deep learning models. The experiments showed results faster than Im2col-GEMM in 93.6% of the convolutions.
arXiv Detail & Related papers (2024-07-15T13:58:24Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval. We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
ACPO: AI-Enabled Compiler Framework [1.752593459729982]
This paper presents ACPO: An AI-Enabled Compiler Framework. It provides LLVM with simple and comprehensive tools to benefit from employing ML models for different optimization passes. We show that ACPO can provide a combined speedup of 4.5% on Polybench and 2.4% on Cbench when compared with LLVM's O3.
arXiv Detail & Related papers (2023-12-15T17:49:24Z)
Im2win: Memory Efficient Convolution On SIMD Architectures [2.153650601445911]
We propose a new memory-efficient data transformation algorithm, called im2win. Our results show that our algorithm reduces the memory overhead by average to 41.6% compared to the PyTorch's convolution implementation.
arXiv Detail & Related papers (2023-06-25T19:21:10Z)
Im2win: An Efficient Convolution Paradigm on GPU [1.9162301033784574]
This paper proposes a paradigm on convolution-based convolutions called im2win, which only reduces memory footprint but also offers continuous memory accesses. We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution, and six$$ DNN-based convolution implementations, with twelve state-of-the-art benchmarks.
arXiv Detail & Related papers (2023-06-25T19:09:56Z)
HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs [11.007606356081435]
Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. There is no work that optimized SimpleX on multi-core CPUs, leading to limited performance. We propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs.
arXiv Detail & Related papers (2023-04-14T18:07:26Z)
Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture. We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z)
8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values. This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters. In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z)
Inception Convolution with Efficient Dilation Search [121.41030859447487]
Dilation convolution is a critical mutant of standard convolution neural network to control effective receptive fields and handle large scale variance of objects. We propose a new mutant of dilated convolution, namely inception (dilated) convolution where the convolutions have independent dilation among different axes, channels and layers. We explore a practical method for fitting the complex inception convolution to the data, a simple while effective dilation search algorithm(EDO) based on statistical optimization is developed.
arXiv Detail & Related papers (2020-12-25T14:58:35Z)
Dynamic Region-Aware Convolution [85.20099799084026]
We propose a new convolution called Dynamic Region-Aware Convolution (DRConv), which can automatically assign multiple filters to corresponding spatial regions. On ImageNet classification, DRConv-based ShuffleNetV2-0.5x achieves state-of-the-art performance of 67.1% at 46M multiply-adds level with 6.3% relative improvement.
arXiv Detail & Related papers (2020-03-27T05:49:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.