Log-Linear-Time Gaussian Processes Using Binary Tree Kernels
- URL: http://arxiv.org/abs/2210.01633v1
- Date: Tue, 4 Oct 2022 14:30:06 GMT
- Title: Log-Linear-Time Gaussian Processes Using Binary Tree Kernels
- Authors: Michael K. Cohen, Samuel Daulton, Michael A. Osborne
- Abstract summary: We present a new kernel that allows for Gaussian process regression in $O((n+m)log(n+m))$ time.
Our "binary tree" kernel places all data points on the leaves of a binary tree, with the kernel depending only on the depth of the deepest common ancestor.
On large datasets, the binary tree GP is fastest, and much faster than a Mat'ern GP.
- Score: 26.526204269075766
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Gaussian processes (GPs) produce good probabilistic models of functions, but
most GP kernels require $O((n+m)n^2)$ time, where $n$ is the number of data
points and $m$ the number of predictive locations. We present a new kernel that
allows for Gaussian process regression in $O((n+m)\log(n+m))$ time. Our "binary
tree" kernel places all data points on the leaves of a binary tree, with the
kernel depending only on the depth of the deepest common ancestor. We can store
the resulting kernel matrix in $O(n)$ space in $O(n \log n)$ time, as a sum of
sparse rank-one matrices, and approximately invert the kernel matrix in $O(n)$
time. Sparse GP methods also offer linear run time, but they predict less well
than higher dimensional kernels. On a classic suite of regression tasks, we
compare our kernel against Mat\'ern, sparse, and sparse variational kernels.
The binary tree GP assigns the highest likelihood to the test data on a
plurality of datasets, usually achieves lower mean squared error than the
sparse methods, and often ties or beats the Mat\'ern GP. On large datasets, the
binary tree GP is fastest, and much faster than a Mat\'ern GP.
Related papers
- Large-Scale Gaussian Processes via Alternating Projection [23.79090469387859]
We propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling mini-batching.
Our algorithm, based on alternating projection, has $mathcalO(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets.
arXiv Detail & Related papers (2023-10-26T04:20:36Z) - Data-Driven Linear Complexity Low-Rank Approximation of General Kernel
Matrices: A Geometric Approach [0.9453554184019107]
A kernel matrix may be defined as $K_ij = kappa(x_i,y_j)$ where $kappa(x,y)$ is a kernel function.
We seek a low-rank approximation to a kernel matrix where the sets of points $X$ and $Y$ are large.
arXiv Detail & Related papers (2022-12-24T07:15:00Z) - Sub-quadratic Algorithms for Kernel Matrices via Kernel Density
Estimation [24.166833799353476]
We develop efficient reductions from $textitweighted edge sampling$ on kernel graphs, $textitsimulating random walks$ on kernel graphs, and $textitweighted sampling$ on matrices to Kernel Density Estimation.
Our reductions are the central ingredient in each of our applications and we believe they may be of independent interest.
arXiv Detail & Related papers (2022-12-01T16:42:56Z) - On The Relative Error of Random Fourier Features for Preserving Kernel
Distance [7.383448514243228]
We show that for a significant range of kernels, including the well-known Laplacian kernels, RFF cannot approximate the kernel distance with small relative error using low dimensions.
We make the first step towards data-oblivious dimension-reduction for general shift-invariant kernels, and we obtain a similar $mathrmpoly(epsilon-1 log n)$ dimension bound for Laplacian kernels.
arXiv Detail & Related papers (2022-10-01T10:35:12Z) - Leverage Score Sampling for Tensor Product Matrices in Input Sparsity
Time [54.65688986250061]
We give an input sparsity time sampling algorithm for approximating the Gram matrix corresponding to the $q$-fold column-wise tensor product of $q$ matrices.
Our sampling technique relies on a collection of $q$ partially correlated random projections which can be simultaneously applied to a dataset $X$ in total time.
arXiv Detail & Related papers (2022-02-09T15:26:03Z) - Gaussian Process Bandit Optimization with Few Batches [49.896920704012395]
We introduce a batch algorithm inspired by finite-arm bandit algorithms.
We show that it achieves the cumulative regret upper bound $Oast(sqrtTgamma_T)$ using $O(loglog T)$ batches within time horizon $T$.
In addition, we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches.
arXiv Detail & Related papers (2021-10-15T00:54:04Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Fast Sketching of Polynomial Kernels of Polynomial Degree [61.83993156683605]
kernel is especially important as other kernels can often be approximated by the kernel via a Taylor series expansion.
Recent techniques in sketching reduce the dependence in the running time on the degree oblivious $q$ of the kernel.
We give a new sketch which greatly improves upon this running time, by removing the dependence on $q$ in the leading order term.
arXiv Detail & Related papers (2021-08-21T02:14:55Z) - Faster Kernel Matrix Algebra via Density Estimation [46.253698241653254]
We study fast algorithms for computing fundamental properties of a positive semidefinite kernel matrix $K in mathbbRn times n$ corresponding to $n$ points.
arXiv Detail & Related papers (2021-02-16T18:25:47Z) - Linear Time Sinkhorn Divergences using Positive Features [51.50788603386766]
Solving optimal transport with an entropic regularization requires computing a $ntimes n$ kernel matrix that is repeatedly applied to a vector.
We propose to use instead ground costs of the form $c(x,y)=-logdotpvarphi(x)varphi(y)$ where $varphi$ is a map from the ground space onto the positive orthant $RRr_+$, with $rll n$.
arXiv Detail & Related papers (2020-06-12T10:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.