Related papers: Efficient GPU implementation of randomized SVD and its applications

Efficient GPU implementation of randomized SVD and its applications

URL: http://arxiv.org/abs/2110.03423v2
Date: Tue, 12 Mar 2024 12:12:29 GMT
Title: Efficient GPU implementation of randomized SVD and its applications
Authors: {\L}ukasz Struski, Pawe{\l} Morkisz, Przemys{\l}aw Spurek, Samuel Rodriguez Bernabeu, Tomasz Trzci\'nski
Abstract summary: Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality data compression and deep learning algorithms. Typical solutions for matrix decompositions have complexity which significantly increases their computational cost and time. We leverage efficient processing operations that can be run in parallel on modern Graphical Processing Units (GPUs) to reduce the computational burden of computing matrix decompositions.
Score: 17.71779625877989
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality reduction, data compression and deep learning algorithms. Typical solutions for matrix decompositions have polynomial complexity which significantly increases their computational cost and time. In this work, we leverage efficient processing operations that can be run in parallel on modern Graphical Processing Units (GPUs), predominant computing architecture used e.g. in deep learning, to reduce the computational burden of computing matrix decompositions. More specifically, we reformulate the randomized decomposition problem to incorporate fast matrix multiplication operations (BLAS-3) as building blocks. We show that this formulation, combined with fast random number generators, allows to fully exploit the potential of parallel processing implemented in GPUs. Our extensive evaluation confirms the superiority of this approach over the competing methods and we release the results of this research as a part of the official CUDA implementation (https://docs.nvidia.com/cuda/cusolver/index.html).

Related papers

Improving Algorithmic Efficiency using Cryptography [11.496343300483904]
We show how to use cryptography to improve the time complexity of solving computational problems.<n>We show that under standard cryptographic assumptions, we can design algorithms that are determinantally faster than existing ones.
arXiv Detail & Related papers (2025-02-18T17:08:59Z)
A Simple Sparse Matrix Vector Multiplication Approach to Padded Convolution [0.0]
We introduce an algorithm for efficiently representing convolution with zero-padding and stride as a sparse transformation matrix. We provide an explicit expression for the number of non-zero multiplications in convolutions with stride and padding, offering insight into the potential for leveraging sparsity in convolution operations.
arXiv Detail & Related papers (2024-11-29T00:14:24Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra [62.37017125812101]
We propose a simple but general framework for large-scale linear algebra problems in machine learning, named CoLA. By combining a linear operator abstraction with compositional dispatch rules, CoLA automatically constructs memory and runtime efficient numerical algorithms. We showcase its efficacy across a broad range of applications, including partial differential equations, Gaussian processes, equivariant model construction, and unsupervised learning.
arXiv Detail & Related papers (2023-09-06T14:59:38Z)
Randomized Polar Codes for Anytime Distributed Machine Learning [66.46612460837147]
We present a novel distributed computing framework that is robust to slow compute nodes, and is capable of both approximate and exact computation of linear operations. We propose a sequential decoding algorithm designed to handle real valued data while maintaining low computational complexity for recovery. We demonstrate the potential applications of this framework in various contexts, such as large-scale matrix multiplication and black-box optimization.
arXiv Detail & Related papers (2023-09-01T18:02:04Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
Fast and Accurate Pseudoinverse with Sparse Matrix Reordering and Incremental Approach [4.710916891482697]
A pseudoinverse is a generalization of a matrix inverse, which has been extensively utilized in machine learning. FastPI is a novel incremental singular value decomposition (SVD) based pseudoinverse method for sparse matrices. We show that FastPI computes the pseudoinverse faster than other approximate methods without loss of accuracy.
arXiv Detail & Related papers (2020-11-09T07:47:10Z)
Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions [2.3513645401551333]
We introduce the Concurrent ALS algorithm and library, which offers an interface to Matlab. We show how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.
arXiv Detail & Related papers (2020-10-09T16:55:46Z)
QR and LQ Decomposition Matrix Backpropagation Algorithms for Square, Wide, and Deep -- Real or Complex -- Matrices and Their Software Implementation [0.0]
This article presents matrix backpropagation algorithms for the QR decomposition of matrices $A_m, n$, that are either square (m = n), wide (m n), or deep (m > n), with rank $k = min(m, n)$. We derive novel matrix backpropagation results for the pivoted (full-rank) QR decomposition and for the LQ decomposition of deep input matrices.
arXiv Detail & Related papers (2020-09-19T21:03:37Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.