Large-Scale Gaussian Processes via Alternating Projection
- URL: http://arxiv.org/abs/2310.17137v2
- Date: Fri, 8 Mar 2024 21:25:44 GMT
- Title: Large-Scale Gaussian Processes via Alternating Projection
- Authors: Kaiwen Wu, Jonathan Wenger, Haydn Jones, Geoff Pleiss, Jacob R.
Gardner
- Abstract summary: We propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling mini-batching.
Our algorithm, based on alternating projection, has $mathcalO(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets.
- Score: 23.79090469387859
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training and inference in Gaussian processes (GPs) require solving linear
systems with $n\times n$ kernel matrices. To address the prohibitive
$\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative
methods, like conjugate gradients (CG). However, as datasets increase in
magnitude, the kernel matrices become increasingly ill-conditioned and still
require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases
the size of datasets GPs can be trained on, modern datasets reach scales beyond
its applicability. In this work, we propose an iterative method which only
accesses subblocks of the kernel matrix, effectively enabling mini-batching.
Our algorithm, based on alternating projection, has $\mathcal{O}(n)$
per-iteration time and space complexity, solving many of the practical
challenges of scaling GPs to very large datasets. Theoretically, we prove the
method enjoys linear convergence. Empirically, we demonstrate its fast
convergence in practice and robustness to ill-conditioning. On large-scale
benchmark datasets with up to four million data points, our approach
accelerates GP training and inference by speed-up factors up to $27\times$ and
$72 \times$, respectively, compared to CG.
Related papers
- Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Sparse Kernel Gaussian Processes through Iterative Charted Refinement
(ICR) [0.0]
We present a new, generative method named Iterative Charted Refinement (ICR) to model Gaussian Processes.
ICR represents long- and short-range correlations by combining views of the modeled locations at varying resolutions with a user-provided coordinate chart.
ICR outperforms existing methods in terms of computational speed by one order of magnitude on the CPU and GPU.
arXiv Detail & Related papers (2022-06-21T18:00:01Z) - Scalable First-Order Bayesian Optimization via Structured Automatic
Differentiation [4.061135251278187]
We show that a wide range of kernels gives rise to structured matrices, enabling an exact $mathcalO(n2d)$ matrix-vector multiply for gradient observations and $mathcalO(n2d2)$ for Hessian observations.
Our methods apply to virtually all canonical kernels and automatically extend to complex kernels, like the neural network, radial basis function network, and spectral mixture kernels.
arXiv Detail & Related papers (2022-06-16T17:59:48Z) - The Fast Kernel Transform [21.001203328543006]
We propose the Fast Kernel Transform (FKT), a general algorithm to compute matrix-vector multiplications for datasets in moderate dimensions with quasilinear complexity.
The FKT is easily applied to a broad class of kernels, including Gaussian, Matern, and Rational Quadratic covariance functions and physically motivated Green's functions.
We illustrate the efficacy and versatility of the FKT by providing timing and accuracy benchmarks and by applying it to scale the neighborhood embedding (t-SNE) and Gaussian processes to large real-world data sets.
arXiv Detail & Related papers (2021-06-08T16:15:47Z) - SigGPDE: Scaling Sparse Gaussian Processes on Sequential Data [16.463077353773603]
We develop SigGPDE, a new scalable sparse variational inference framework for Gaussian Processes (GPs) on sequential data.
We show that the gradients of the GP signature kernel are solutions of a hyperbolic partial differential equation (PDE)
This theoretical insight allows us to build an efficient back-propagation algorithm to optimize the ELBO.
arXiv Detail & Related papers (2021-05-10T09:10:17Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Faster Kernel Interpolation for Gaussian Processes [30.04235162264955]
Key challenge in scaling Process (GP) regression to massive datasets is that exact inference requires a dense n x n kernel matrix.
Structured kernel (SKI) is among the most scalable methods.
We show that SKI can be reduced to O(m log m) after a single O(n) time precomputation step.
We demonstrate speedups in practice for a wide range of m and n and apply the method to GP inference on a three-dimensional weather radar dataset with over 100 million points.
arXiv Detail & Related papers (2021-01-28T00:09:22Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.