On the Nystrom Approximation for Preconditioning in Kernel Machines
- URL: http://arxiv.org/abs/2312.03311v4
- Date: Wed, 24 Jan 2024 19:28:43 GMT
- Title: On the Nystrom Approximation for Preconditioning in Kernel Machines
- Authors: Amirhesam Abedsoltan, Parthe Pandit, Luis Rademacher, Mikhail Belkin
- Abstract summary: We show that a Nystrom approximation of the spectral preconditioner is cheaper to compute and store, and has demonstrated success in practical applications.
Specifically, we show that a sample of logarithmic size enables the Nystrom-based approximated preconditioner to accelerate gradient descent nearly as well as the exact preconditioner.
- Score: 13.085943975157985
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Kernel methods are a popular class of nonlinear predictive models in machine
learning. Scalable algorithms for learning kernel models need to be iterative
in nature, but convergence can be slow due to poor conditioning. Spectral
preconditioning is an important tool to speed-up the convergence of such
iterative algorithms for training kernel models. However computing and storing
a spectral preconditioner can be expensive which can lead to large
computational and storage overheads, precluding the application of kernel
methods to problems with large datasets. A Nystrom approximation of the
spectral preconditioner is often cheaper to compute and store, and has
demonstrated success in practical applications. In this paper we analyze the
trade-offs of using such an approximated preconditioner. Specifically, we show
that a sample of logarithmic size (as a function of the size of the dataset)
enables the Nystrom-based approximated preconditioner to accelerate gradient
descent nearly as well as the exact preconditioner, while also reducing the
computational and storage overheads.
Related papers
- Fast training of large kernel models with delayed projections [14.459817519150997]
We present a new methodology for building kernel machines that can scale efficiently with both data size and model size.
Our algorithm introduces delayed projections to Preconditioned Gradient Descent (PSGD) allowing the training of much larger models than was previously feasible.
We validate our algorithm, EigenPro4, demonstrating drastic training speed up over the existing methods while maintaining comparable or better classification accuracy.
arXiv Detail & Related papers (2024-11-25T18:42:13Z) - Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference [55.150117654242706]
We show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU.
As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty.
arXiv Detail & Related papers (2024-11-01T21:11:48Z) - Learning from Linear Algebra: A Graph Neural Network Approach to Preconditioner Design for Conjugate Gradient Solvers [42.69799418639716]
Deep learning models may be used to precondition residuals during iteration of such linear solvers as the conjugate gradient (CG) method.
Neural network models require an enormous number of parameters to approximate well in this setup.
In our work, we recall well-established preconditioners from linear algebra and use them as a starting point for training the GNN.
arXiv Detail & Related papers (2024-05-24T13:44:30Z) - Gaussian Process Regression under Computational and Epistemic Misspecification [4.5656369638728656]
In large data applications, computational costs can be reduced using low-rank or sparse approximations of the kernel.
This paper investigates the effect of such kernel approximations on the element error.
arXiv Detail & Related papers (2023-12-14T18:53:32Z) - Equation Discovery with Bayesian Spike-and-Slab Priors and Efficient Kernels [57.46832672991433]
We propose a novel equation discovery method based on Kernel learning and BAyesian Spike-and-Slab priors (KBASS)
We use kernel regression to estimate the target function, which is flexible, expressive, and more robust to data sparsity and noises.
We develop an expectation-propagation expectation-maximization algorithm for efficient posterior inference and function estimation.
arXiv Detail & Related papers (2023-10-09T03:55:09Z) - Randomized Polar Codes for Anytime Distributed Machine Learning [66.46612460837147]
We present a novel distributed computing framework that is robust to slow compute nodes, and is capable of both approximate and exact computation of linear operations.
We propose a sequential decoding algorithm designed to handle real valued data while maintaining low computational complexity for recovery.
We demonstrate the potential applications of this framework in various contexts, such as large-scale matrix multiplication and black-box optimization.
arXiv Detail & Related papers (2023-09-01T18:02:04Z) - Reconstructing Kernel-based Machine Learning Force Fields with
Super-linear Convergence [0.18416014644193063]
We consider the broad class of Nystr"om-type methods to construct preconditioners.
All considered methods aim to identify a representative subset of inducing ( Kernel) columns to approximate the dominant kernel spectrum.
arXiv Detail & Related papers (2022-12-24T13:45:50Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - FaDIn: Fast Discretized Inference for Hawkes Processes with General
Parametric Kernels [82.53569355337586]
This work offers an efficient solution to temporal point processes inference using general parametric kernels with finite support.
The method's effectiveness is evaluated by modeling the occurrence of stimuli-induced patterns from brain signals recorded with magnetoencephalography (MEG)
Results show that the proposed approach leads to an improved estimation of pattern latency than the state-of-the-art.
arXiv Detail & Related papers (2022-10-10T12:35:02Z) - Learning "best" kernels from data in Gaussian process regression. With
application to aerodynamics [0.4588028371034406]
We introduce algorithms to select/design kernels in Gaussian process regression/kriging surrogate modeling techniques.
A first class of algorithms is kernel flow, which was introduced in a context of classification in machine learning.
A second class of algorithms is called spectral kernel ridge regression, and aims at selecting a "best" kernel such that the norm of the function to be approximated is minimal.
arXiv Detail & Related papers (2022-06-03T07:50:54Z) - Reducing the Variance of Gaussian Process Hyperparameter Optimization
with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication.
We prove that preconditioning has an additional benefit that has been previously unexplored.
It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.