High-Dimensional Gaussian Process Inference with Derivatives
- URL: http://arxiv.org/abs/2102.07542v1
- Date: Mon, 15 Feb 2021 13:24:41 GMT
- Title: High-Dimensional Gaussian Process Inference with Derivatives
- Authors: Filip de Roos, Alexandra Gessner, Philipp Hennig
- Abstract summary: We show that in the low-data regime $ND$, the Gram matrix can be decomposed in a manner that reduces the cost of inference to $mathcalO(N2D + (N2)3)$.
We demonstrate this potential in a variety of tasks relevant for machine learning, such as optimization and Hamiltonian Monte Carlo with predictive gradients.
- Score: 90.8033626920884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although it is widely known that Gaussian processes can be conditioned on
observations of the gradient, this functionality is of limited use due to the
prohibitive computational cost of $\mathcal{O}(N^3 D^3)$ in data points $N$ and
dimension $D$. The dilemma of gradient observations is that a single one of
them comes at the same cost as $D$ independent function evaluations, so the
latter are often preferred. Careful scrutiny reveals, however, that derivative
observations give rise to highly structured kernel Gram matrices for very
general classes of kernels (inter alia, stationary kernels). We show that in
the low-data regime $N<D$, the Gram matrix can be decomposed in a manner that
reduces the cost of inference to $\mathcal{O}(N^2D + (N^2)^3)$ (i.e., linear in
the number of dimensions) and, in special cases, to $\mathcal{O}(N^2D + N^3)$.
This reduction in complexity opens up new use-cases for inference with
gradients especially in the high-dimensional regime, where the
information-to-cost ratio of gradient observations significantly increases. We
demonstrate this potential in a variety of tasks relevant for machine learning,
such as optimization and Hamiltonian Monte Carlo with predictive gradients.
Related papers
- Highly Adaptive Ridge [84.38107748875144]
We propose a regression method that achieves a $n-2/3$ dimension-free L2 convergence rate in the class of right-continuous functions with square-integrable sectional derivatives.
Har is exactly kernel ridge regression with a specific data-adaptive kernel based on a saturated zero-order tensor-product spline basis expansion.
We demonstrate empirical performance better than state-of-the-art algorithms for small datasets in particular.
arXiv Detail & Related papers (2024-10-03T17:06:06Z) - Exploiting Hankel-Toeplitz Structures for Fast Computation of Kernel Precision Matrices [14.25435308779899]
Hilbert-space Gaussian Process (HGP) approach offers hyper-independent basis function approximation for speeding up GP inference.
In this paper, we lower this dominating computational complexity to $mathcalO(NM)$ with no additional approximations.
Our contribution provides a pure speed-up of several existing, widely used, GP approximations, without further approximations.
arXiv Detail & Related papers (2024-08-05T09:45:31Z) - When big data actually are low-rank, or entrywise approximation of certain function-generated matrices [0.0]
We refute an argument made in the literature to prove that, for a specific class of analytic functions, such matrices admit accurate entrywise approximation of rank that is independent of $m$.
We describe three narrower classes of functions for which $n times n$ function-generated matrices can be approximated within an entrywise error of order $varepsilon$ with rank $mathcalO(log(n) varepsilon-2 mathrmpolylog(varepsilon-1)$ that is independent of the dimension $m$
arXiv Detail & Related papers (2024-07-03T16:29:47Z) - Contraction rates for conjugate gradient and Lanczos approximate posteriors in Gaussian process regression [0.0]
We analyze a class of recently proposed approximation algorithms from the field of Probabilistic numerics.
We combine result from the numerical analysis literature with state of the art concentration results for spectra of kernel matrices to obtain minimax contraction rates.
arXiv Detail & Related papers (2024-06-18T14:50:42Z) - Efficient displacement convex optimization with particle gradient
descent [57.88860627977882]
Particle gradient descent is widely used to optimize functions of probability measures.
This paper considers particle gradient descent with a finite number of particles and establishes its theoretical guarantees to optimize functions that are emphdisplacement convex in measures.
arXiv Detail & Related papers (2023-02-09T16:35:59Z) - Scalable First-Order Bayesian Optimization via Structured Automatic
Differentiation [4.061135251278187]
We show that a wide range of kernels gives rise to structured matrices, enabling an exact $mathcalO(n2d)$ matrix-vector multiply for gradient observations and $mathcalO(n2d2)$ for Hessian observations.
Our methods apply to virtually all canonical kernels and automatically extend to complex kernels, like the neural network, radial basis function network, and spectral mixture kernels.
arXiv Detail & Related papers (2022-06-16T17:59:48Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process.
We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator.
We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z) - Minimum complexity interpolation in random features models [16.823029377470366]
kernel methods are heavily affected by the curse of dimensionality.
We show that learning with $mathcalF_p$ norms is tractable in an infinite-dimensional convex problem.
We introduce a proof technique based on uniform concentration in the dual.
arXiv Detail & Related papers (2021-03-30T00:00:02Z) - Linear Time Sinkhorn Divergences using Positive Features [51.50788603386766]
Solving optimal transport with an entropic regularization requires computing a $ntimes n$ kernel matrix that is repeatedly applied to a vector.
We propose to use instead ground costs of the form $c(x,y)=-logdotpvarphi(x)varphi(y)$ where $varphi$ is a map from the ground space onto the positive orthant $RRr_+$, with $rll n$.
arXiv Detail & Related papers (2020-06-12T10:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.