Sliding Window Sum Algorithms for Deep Neural Networks
- URL: http://arxiv.org/abs/2305.16513v1
- Date: Thu, 25 May 2023 22:37:40 GMT
- Title: Sliding Window Sum Algorithms for Deep Neural Networks
- Authors: Roman Snytsar
- Abstract summary: Sliding window sums are widely used for string indexing, hashing and time series analysis.
We have developed a family of generic vectorized sliding sum algorithms that provide speedup of O(P/w) for window size $w$ and number of processors P.
We show that the sliding sum convolution kernels are more efficient than the commonly used GEMM kernels on the CPU, and could even outperform their GPU counterparts.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sliding window sums are widely used for string indexing, hashing and time
series analysis. We have developed a family of the generic vectorized sliding
sum algorithms that provide speedup of O(P/w) for window size $w$ and number of
processors P. For a sum with a commutative operator the speedup is improved to
O(P/log(w)). Even more important, our algorithms exhibit efficient memory
access patterns. In this paper we study the application of the sliding sum
algorithms to the training and inference of the Deep Neural Networks. We
demonstrate how both pooling and convolution primitives could be expressed as
sliding sums and evaluated by the compute kernels with the shared structure. We
show that the sliding sum convolution kernels are more efficient than the
commonly used GEMM kernels on the CPU, and could even outperform their GPU
counterparts.
Related papers
- Accelerating Machine Learning Primitives on Commodity Hardware [0.0]
We present an extensive study of the Sliding Window convolution technique as a more efficient alternative to the commonly used General Matrix multiplication (GEMM) based convolution in Deep Neural Networks (DNNs)
Our results suggest that the Sliding Window computation kernels can outperform GEMM-based convolution on a CPU and even on dedicated hardware accelerators.
This could promote a wider adoption of AI on low-power and low-memory devices without the need for specialized hardware.
arXiv Detail & Related papers (2023-10-08T16:26:18Z) - Provable advantages of kernel-based quantum learners and quantum
preprocessing based on Grover's algorithm [0.0]
We find a speedup utilizing Grover's algorithm in the kernel of a support vector machine.
We also show that combining quantum computation in a preprocessing step with classical methods for classification further improves classification performance.
arXiv Detail & Related papers (2023-09-25T18:00:00Z) - Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO)
TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models.
Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Provably Faster Algorithms for Bilevel Optimization [54.83583213812667]
Bilevel optimization has been widely applied in many important machine learning applications.
We propose two new algorithms for bilevel optimization.
We show that both algorithms achieve the complexity of $mathcalO(epsilon-1.5)$, which outperforms all existing algorithms by the order of magnitude.
arXiv Detail & Related papers (2021-06-08T21:05:30Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network.
We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - Scalable Plug-and-Play ADMM with Convergence Guarantees [24.957046830965822]
We propose an incremental variant of the widely used.
ADMM algorithm, making it scalable to large-scale datasets.
We theoretically analyze the convergence algorithm under a set explicit assumptions.
arXiv Detail & Related papers (2020-06-05T04:10:15Z) - Minimal Filtering Algorithms for Convolutional Neural Networks [82.24592140096622]
We develop fully parallel hardware-oriented algorithms for implementing the basic filtering operation for M=3,5,7,9, and 11.
A fully parallel hardware implementation of the proposed algorithms in each case gives approximately 30 percent savings in the number of embedded multipliers.
arXiv Detail & Related papers (2020-04-12T13:18:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.