Representing Additive Gaussian Processes by Sparse Matrices
- URL: http://arxiv.org/abs/2305.00324v1
- Date: Sat, 29 Apr 2023 18:53:42 GMT
- Title: Representing Additive Gaussian Processes by Sparse Matrices
- Authors: Lu Zou, Haoyuan Chen, Liang Ding
- Abstract summary: Mat'ern Gaussian Processes (GPs) are one of the most popular scalable high-dimensional problems.
Back-fitting algorithms can reduce the time complexity of computing the posterior mean from $O(n3)$ to $O(nlog n)$ time.
Generalizing these algorithms to efficiently compute the posterior variance and maximum log-likelihood remains an open problem.
- Score: 18.618437338490487
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Among generalized additive models, additive Mat\'ern Gaussian Processes (GPs)
are one of the most popular for scalable high-dimensional problems. Thanks to
their additive structure and stochastic differential equation representation,
back-fitting-based algorithms can reduce the time complexity of computing the
posterior mean from $O(n^3)$ to $O(n\log n)$ time where $n$ is the data size.
However, generalizing these algorithms to efficiently compute the posterior
variance and maximum log-likelihood remains an open problem. In this study, we
demonstrate that for Additive Mat\'ern GPs, not only the posterior mean, but
also the posterior variance, log-likelihood, and gradient of these three
functions can be represented by formulas involving only sparse matrices and
sparse vectors. We show how to use these sparse formulas to generalize
back-fitting-based algorithms to efficiently compute the posterior mean,
posterior variance, log-likelihood, and gradient of these three functions for
additive GPs, all in $O(n \log n)$ time. We apply our algorithms to Bayesian
optimization and propose efficient algorithms for posterior updates,
hyperparameters learning, and computations of the acquisition function and its
gradient in Bayesian optimization. Given the posterior, our algorithms
significantly reduce the time complexity of computing the acquisition function
and its gradient from $O(n^2)$ to $O(\log n)$ for general learning rate, and
even to $O(1)$ for small learning rate.
Related papers
- Inverting the Leverage Score Gradient: An Efficient Approximate Newton Method [10.742859956268655]
This paper aims to recover the intrinsic model parameters given the leverage scores gradient.
We specifically scrutinize the inversion of the leverage score gradient, denoted as $g(x)$.
arXiv Detail & Related papers (2024-08-21T01:39:42Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Kernel Packet: An Exact and Scalable Algorithm for Gaussian Process
Regression with Mat\'ern Correlations [23.560067934682294]
We develop an exact and scalable algorithm for one-dimensional Gaussian process regression with Mat'ern correlations.
The proposed algorithm is significantly superior to the existing alternatives in both the computational time and predictive accuracy.
arXiv Detail & Related papers (2022-03-07T03:30:35Z) - Computing the Newton-step faster than Hessian accumulation [8.147652597876862]
We show that given the computational graph of the function, this bound can be reduced to $O(mtau3)$, where $tau, m$ are the width and size of a tree-decomposition of the graph.
The proposed algorithm generalizes nonlinear optimal-control methods based on LQR to general optimization problems and provides non-trivial gains in iteration-complexity even in cases where the Hessian is dense.
arXiv Detail & Related papers (2021-08-02T11:22:08Z) - Efficient Matrix-Free Approximations of Second-Order Information, with
Applications to Pruning and Optimization [16.96639526117016]
We investigate matrix-free, linear-time approaches for estimating Inverse-Hessian Vector Products (IHVPs)
These algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods.
arXiv Detail & Related papers (2021-07-07T17:01:34Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - Gradient Free Minimax Optimization: Variance Reduction and Faster
Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem.
We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z) - Private Stochastic Convex Optimization: Optimal Rates in Linear Time [74.47681868973598]
We study the problem of minimizing the population loss given i.i.d. samples from a distribution over convex loss functions.
A recent work of Bassily et al. has established the optimal bound on the excess population loss achievable given $n$ samples.
We describe two new techniques for deriving convex optimization algorithms both achieving the optimal bound on excess loss and using $O(minn, n2/d)$ gradient computations.
arXiv Detail & Related papers (2020-05-10T19:52:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.