Kernel methods through the roof: handling billions of points efficiently
- URL: http://arxiv.org/abs/2006.10350v2
- Date: Thu, 26 Nov 2020 15:55:17 GMT
- Title: Kernel methods through the roof: handling billions of points efficiently
- Authors: Giacomo Meanti, Luigi Carratino, Lorenzo Rosasco, Alessandro Rudi
- Abstract summary: Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
- Score: 94.31450736250918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Kernel methods provide an elegant and principled approach to nonparametric
learning, but so far could hardly be used in large scale problems, since
na\"ive implementations scale poorly with data size. Recent advances have shown
the benefits of a number of algorithmic ideas, for example combining
optimization, numerical linear algebra and random projections. Here, we push
these efforts further to develop and test a solver that takes full advantage of
GPU hardware. Towards this end, we designed a preconditioned gradient solver
for kernel methods exploiting both GPU acceleration and parallelization with
multiple GPUs, implementing out-of-core variants of common linear algebra
operations to guarantee optimal hardware utilization. Further, we optimize the
numerical precision of different operations and maximize efficiency of
matrix-vector multiplications. As a result we can experimentally show dramatic
speedups on datasets with billions of points, while still guaranteeing state of
the art performance. Additionally, we make our software available as an easy to
use library.
Related papers
- Implementation and Analysis of GPU Algorithms for Vecchia Approximation [0.8057006406834466]
Vecchia Approximation is widely used to reduce the computational complexity and can be calculated with embarrassingly parallel algorithms.
While multi-core software has been developed for Vecchia Approximation, software designed to run on graphics processing units ( GPU) is lacking.
We show that our new method outperforms the other two and then present it in the GpGpU R package.
arXiv Detail & Related papers (2024-07-03T01:24:44Z) - Accelerating Cutting-Plane Algorithms via Reinforcement Learning
Surrogates [49.84541884653309]
A current standard approach to solving convex discrete optimization problems is the use of cutting-plane algorithms.
Despite the existence of a number of general-purpose cut-generating algorithms, large-scale discrete optimization problems continue to suffer from intractability.
We propose a method for accelerating cutting-plane algorithms via reinforcement learning.
arXiv Detail & Related papers (2023-07-17T20:11:56Z) - Bringing regularized optimal transport to lightspeed: a splitting method
adapted for GPUs [9.297785393486976]
We present an efficient algorithm for regularized optimal transport.
In contrast to previous methods, we use the Douglas-Rachford splitting technique to develop an efficient solver that can handle a broad class of regularizers.
arXiv Detail & Related papers (2023-05-29T12:04:55Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z) - On a class of geodesically convex optimization problems solved via
Euclidean MM methods [50.428784381385164]
We show how a difference of Euclidean convexization functions can be written as a difference of different types of problems in statistics and machine learning.
Ultimately, we helps the broader broader the broader the broader the broader the work.
arXiv Detail & Related papers (2022-06-22T23:57:40Z) - Efficient GPU implementation of randomized SVD and its applications [17.71779625877989]
Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality data compression and deep learning algorithms.
Typical solutions for matrix decompositions have complexity which significantly increases their computational cost and time.
We leverage efficient processing operations that can be run in parallel on modern Graphical Processing Units (GPUs) to reduce the computational burden of computing matrix decompositions.
arXiv Detail & Related papers (2021-10-05T07:42:41Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Systolic Computing on GPUs for Productive Performance [2.8064596842326575]
We propose a language and compiler to productively build high-performance systolic arrays that run on GPUs.
A programmer it' specifies a projection of a dataflow compute onto a linear systolic array, while leaving the detailed implementation of the projection to a compiler.
The compiler implements the specified projection and maps the linear systolic array to the SIMD execution units and vector registers of GPUs.
arXiv Detail & Related papers (2020-10-29T18:49:54Z) - Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms [1.3249453757295084]
We study training algorithms for deep learning on heterogeneous CPU+GPU architectures.
Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging.
We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
arXiv Detail & Related papers (2020-04-19T05:21:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.