High-Dimensional Gaussian Process Inference with Derivatives
- URL: http://arxiv.org/abs/2102.07542v1
- Date: Mon, 15 Feb 2021 13:24:41 GMT
- Title: High-Dimensional Gaussian Process Inference with Derivatives
- Authors: Filip de Roos, Alexandra Gessner, Philipp Hennig
- Abstract summary: We show that in the low-data regime $ND$, the Gram matrix can be decomposed in a manner that reduces the cost of inference to $mathcalO(N2D + (N2)3)$.
We demonstrate this potential in a variety of tasks relevant for machine learning, such as optimization and Hamiltonian Monte Carlo with predictive gradients.
- Score: 90.8033626920884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although it is widely known that Gaussian processes can be conditioned on
observations of the gradient, this functionality is of limited use due to the
prohibitive computational cost of $\mathcal{O}(N^3 D^3)$ in data points $N$ and
dimension $D$. The dilemma of gradient observations is that a single one of
them comes at the same cost as $D$ independent function evaluations, so the
latter are often preferred. Careful scrutiny reveals, however, that derivative
observations give rise to highly structured kernel Gram matrices for very
general classes of kernels (inter alia, stationary kernels). We show that in
the low-data regime $N<D$, the Gram matrix can be decomposed in a manner that
reduces the cost of inference to $\mathcal{O}(N^2D + (N^2)^3)$ (i.e., linear in
the number of dimensions) and, in special cases, to $\mathcal{O}(N^2D + N^3)$.
This reduction in complexity opens up new use-cases for inference with
gradients especially in the high-dimensional regime, where the
information-to-cost ratio of gradient observations significantly increases. We
demonstrate this potential in a variety of tasks relevant for machine learning,
such as optimization and Hamiltonian Monte Carlo with predictive gradients.
Related papers
- When big data actually are low-rank, or entrywise approximation of certain function-generated matrices [0.0]
We refute an argument made in the literature that, for a specific class of analytic functions, such matrices admit accurate entrywise approximation of rank that is independent of $m$.
We describe three narrower classes of functions for which $n times n$ function-generated matrices can be approximated within an entrywise error of order $varepsilon$ with rank $mathcalO(log(n) varepsilon-2 mathrmpolylog(varepsilon-1)$ that is independent of the dimension $m$.
arXiv Detail & Related papers (2024-07-03T16:29:47Z) - Contraction rates for conjugate gradient and Lanczos approximate posteriors in Gaussian process regression [0.0]
We analyze a class of recently proposed approximation algorithms from the field of Probabilistic numerics.
We combine result from the numerical analysis literature with state of the art concentration results for spectra of kernel matrices to obtain minimax contraction rates.
arXiv Detail & Related papers (2024-06-18T14:50:42Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Efficient displacement convex optimization with particle gradient
descent [57.88860627977882]
Particle gradient descent is widely used to optimize functions of probability measures.
This paper considers particle gradient descent with a finite number of particles and establishes its theoretical guarantees to optimize functions that are emphdisplacement convex in measures.
arXiv Detail & Related papers (2023-02-09T16:35:59Z) - Scalable First-Order Bayesian Optimization via Structured Automatic
Differentiation [4.061135251278187]
We show that a wide range of kernels gives rise to structured matrices, enabling an exact $mathcalO(n2d)$ matrix-vector multiply for gradient observations and $mathcalO(n2d2)$ for Hessian observations.
Our methods apply to virtually all canonical kernels and automatically extend to complex kernels, like the neural network, radial basis function network, and spectral mixture kernels.
arXiv Detail & Related papers (2022-06-16T17:59:48Z) - Gradient Descent for Low-Rank Functions [36.56489593549855]
In machine learning tasks, e.g., training deep neural networks, the loss function varies significantly in only a few directions of the input.
Our proposed emphLowRank Descent finds an $epsilon gradient function by first identifying $mathcalO(plog(1/epsilon))$gd and $mathcalOp/epsilon2)$p/epsilon2)$.
arXiv Detail & Related papers (2022-06-16T15:58:05Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process.
We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator.
We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z) - Minimum complexity interpolation in random features models [16.823029377470366]
kernel methods are heavily affected by the curse of dimensionality.
We show that learning with $mathcalF_p$ norms is tractable in an infinite-dimensional convex problem.
We introduce a proof technique based on uniform concentration in the dual.
arXiv Detail & Related papers (2021-03-30T00:00:02Z) - Linear Time Sinkhorn Divergences using Positive Features [51.50788603386766]
Solving optimal transport with an entropic regularization requires computing a $ntimes n$ kernel matrix that is repeatedly applied to a vector.
We propose to use instead ground costs of the form $c(x,y)=-logdotpvarphi(x)varphi(y)$ where $varphi$ is a map from the ground space onto the positive orthant $RRr_+$, with $rll n$.
arXiv Detail & Related papers (2020-06-12T10:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.