Related papers: Giga-scale Kernel Matrix Vector Multiplication on GPU

Giga-scale Kernel Matrix Vector Multiplication on GPU

URL: http://arxiv.org/abs/2202.01085v1
Date: Wed, 2 Feb 2022 15:28:15 GMT
Title: Giga-scale Kernel Matrix Vector Multiplication on GPU
Authors: Robert Hu, Dino Sejdinovic, Joan Alexis Glaun\`es
Abstract summary: Kernel matrix vector multiplication (KMVM) is a ubiquitous operation in machine learning and scientific computing, spanning from the kernel literature to signal processing. We propose a novel approximation procedure coined Faster-Fast and Free Memory Method ($textF3$M) to address these scaling issues for KMVM. We show that $textF3$M can compute a full KMVM for a billion points emphin under one minute on a high-end GPU, leading to a significant speed-up in comparison to existing CPU methods.
Score: 9.106412307976067
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Kernel matrix vector multiplication (KMVM) is a ubiquitous operation in machine learning and scientific computing, spanning from the kernel literature to signal processing. As kernel matrix vector multiplication tends to scale quadratically in both memory and time, applications are often limited by these computational scaling constraints. We propose a novel approximation procedure coined Faster-Fast and Free Memory Method ($\text{F}^3$M) to address these scaling issues for KMVM. Extensive experiments demonstrate that $\text{F}^3$M has empirical \emph{linear time and memory} complexity with a relative error of order $10^{-3}$ and can compute a full KMVM for a billion points \emph{in under one minute} on a high-end GPU, leading to a significant speed-up in comparison to existing CPU methods. We further demonstrate the utility of our procedure by applying it as a drop-in for the state-of-the-art GPU-based linear solver FALKON, \emph{improving speed 3-5 times} at the cost of $<$1\% drop in accuracy.

Related papers

Machine learning-driven conservative-to-primitive conversion in hybrid piecewise polytropic and tabulated equations of state [0.2999888908665658]
We present a novel machine learning (ML) method to accelerate conservative-to-primitive inversion in hydrodynamics simulations.<n>We employ feedforward neural networks (NNC2PS and NNC2PL), trained in PyTorch, and optimized for GPU inference using NVIDIART.<n>The mixed-precisionRT engine for NNC2PS inference speeds approximately 400 times faster than a traditional single-threaded implementation for a dataset size of 1,000,000 points.
arXiv Detail & Related papers (2024-12-10T19:00:01Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
Fast Evaluation of Additive Kernels: Feature Arrangement, Fourier Methods, and Kernel Derivatives [0.5735035463793009]
We present a technique based on the non-equispaced fast Fourier transform (NFFT) with rigorous error analysis. We show that this approach is also well suited to allow the approximation of the matrix that arises when the kernel is differentiated. We illustrate the performance of the additive kernel scheme with fast matrix vector products on a number of data sets.
arXiv Detail & Related papers (2024-04-26T11:50:16Z)
Large-Scale Gaussian Processes via Alternating Projection [23.79090469387859]
We propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling mini-batching. Our algorithm, based on alternating projection, has $mathcalO(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets.
arXiv Detail & Related papers (2023-10-26T04:20:36Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Snacks: a fast large-scale kernel SVM solver [0.8602553195689513]
Snacks is a new large-scale solver for Kernel Support Vector Machines. Snacks relies on a Nystr"om approximation of the kernel matrix and an accelerated variant of the subgradient method.
arXiv Detail & Related papers (2023-04-17T04:19:20Z)
Sub-quadratic Algorithms for Kernel Matrices via Kernel Density Estimation [24.166833799353476]
We develop efficient reductions from $textitweighted edge sampling$ on kernel graphs, $textitsimulating random walks$ on kernel graphs, and $textitweighted sampling$ on matrices to Kernel Density Estimation. Our reductions are the central ingredient in each of our applications and we believe they may be of independent interest.
arXiv Detail & Related papers (2022-12-01T16:42:56Z)
Optimizing Data Collection in Deep Reinforcement Learning [4.9709347068704455]
GPU vectorization can achieve up to $1024times$ speedup over commonly used CPU simulators. We show that simulator kernel fusion speedups with a simple simulator are $11.3times$ and increase by up to $1024times$ as simulator complexity increases in terms of memory bandwidth requirements.
arXiv Detail & Related papers (2022-07-15T20:22:31Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
Recipe for Fast Large-scale SVM Training: Polishing, Parallelism, and more RAM! [0.0]
Support vector machines (SVMs) are a standard method in the machine learning toolbox. Non-linear kernel SVMs often deliver highly accurate predictors, however, at the cost of long training times. In this work, we combine both approaches to design an extremely fast dual SVM solver.
arXiv Detail & Related papers (2022-07-03T11:51:41Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
Fast Sketching of Polynomial Kernels of Polynomial Degree [61.83993156683605]
kernel is especially important as other kernels can often be approximated by the kernel via a Taylor series expansion. Recent techniques in sketching reduce the dependence in the running time on the degree oblivious $q$ of the kernel. We give a new sketch which greatly improves upon this running time, by removing the dependence on $q$ in the leading order term.
arXiv Detail & Related papers (2021-08-21T02:14:55Z)
VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator. textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z)
GPU-Accelerated Primal Learning for Extremely Fast Large-Scale Classification [10.66048003460524]
One of the most efficient methods to solve L2-regularized primal problems, such as logistic regression and linear support vector machine (SVM) classification, is the widely used trust region Newton algorithm, TRON. We show that using judicious GPU-optimization principles, TRON training time for different losses and feature representations may be drastically reduced.
arXiv Detail & Related papers (2020-08-08T03:40:27Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.