Related papers: Fast Matrix Square Roots with Applications to Gaussian Processes and Bayesian Optimization

Fast Matrix Square Roots with Applications to Gaussian Processes and Bayesian Optimization

URL: http://arxiv.org/abs/2006.11267v2
Date: Mon, 30 Nov 2020 21:52:07 GMT
Title: Fast Matrix Square Roots with Applications to Gaussian Processes and Bayesian Optimization
Authors: Geoff Pleiss, Martin Jankowiak, David Eriksson, Anil Damle, Jacob R. Gardner
Abstract summary: Matrix square roots and their inverses arise frequently in machine learning. We introduce a highly-efficient quadratic-time algorithm for computing $mathbf K1/2 mathbf b$, $mathbf K-1/2 mathbf b$, and their derivatives through matrix-vector multiplication (MVMs) Our method combines Krylov subspace methods with a rational approximation and typically $4$ decimal places of accuracy with fewer than $100$ MVMs.
Score: 24.085358075449406
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Matrix square roots and their inverses arise frequently in machine learning, e.g., when sampling from high-dimensional Gaussians $\mathcal{N}(\mathbf 0, \mathbf K)$ or whitening a vector $\mathbf b$ against covariance matrix $\mathbf K$. While existing methods typically require $O(N^3)$ computation, we introduce a highly-efficient quadratic-time algorithm for computing $\mathbf K^{1/2} \mathbf b$, $\mathbf K^{-1/2} \mathbf b$, and their derivatives through matrix-vector multiplication (MVMs). Our method combines Krylov subspace methods with a rational approximation and typically achieves $4$ decimal places of accuracy with fewer than $100$ MVMs. Moreover, the backward pass requires little additional computation. We demonstrate our method's applicability on matrices as large as $50,\!000 \times 50,\!000$ - well beyond traditional methods - with little approximation error. Applying this increased scalability to variational Gaussian processes, Bayesian optimization, and Gibbs sampling results in more powerful models with higher accuracy.

Related papers

Allocating Variance to Maximize Expectation [2.25491649634702]
We design efficient approximation algorithms for maximizing the expectation of the supremum of families of Gaussian random variables. Such expectation problems occur in diverse applications, ranging from utility auctions to learning mixture models in quantitative genetics.
arXiv Detail & Related papers (2025-02-25T18:59:46Z)
Optimal Quantization for Matrix Multiplication [35.007966885532724]
We present a universal quantizer based on nested lattices with an explicit guarantee of approximation error for any (non-random) pair of matrices $A$, $B$ in terms of only Frobenius norms $|A|_F, |B|_F$ and $|Atop B|_F$.
arXiv Detail & Related papers (2024-10-17T17:19:48Z)
Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch. We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work. We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z)
Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers [16.046186753149]
Self-attention mechanism is the key to the success of transformers in recent Large Language Models (LLMs) We leverage the convolution-like structure of attention matrices to develop an efficient approximation method for attention using convolution matrices. We hope our new paradigm for accelerating attention computation in transformer models can help their application to longer contexts.
arXiv Detail & Related papers (2024-05-08T17:11:38Z)
Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. In this work, we initiate the study of provably learning a multi-head attention layer from random examples. We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z)
Matrix Compression via Randomized Low Rank and Low Precision Factorization [47.902465710511485]
Modern matrices can involve billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. We propose an algorithm that exploits this structure to obtain a low rank decomposition of any matrix $mathbfA$ as $mathbfLmathbfR$. We empirically demonstrate the efficacy of our algorithm in image compression, nearest neighbor classification of image and text embeddings, and compressing the layers of LlaMa-$7$b.
arXiv Detail & Related papers (2023-10-17T06:56:57Z)
Fast $(1+\varepsilon)$-Approximation Algorithms for Binary Matrix Factorization [54.29685789885059]
We introduce efficient $(1+varepsilon)$-approximation algorithms for the binary matrix factorization (BMF) problem. The goal is to approximate $mathbfA$ as a product of low-rank factors. Our techniques generalize to other common variants of the BMF problem.
arXiv Detail & Related papers (2023-06-02T18:55:27Z)
Multiplicative updates for symmetric-cone factorizations [0.0]
We introduce and analyze the symmetric-cone multiplicative update (SCMU) algorithm for computing cone factorizations. We show that the squared loss objective is non-decreasing along the trajectories of the SCMU algorithm.
arXiv Detail & Related papers (2021-08-02T09:23:39Z)
Non-PSD Matrix Sketching with Applications to Regression and Optimization [56.730993511802865]
We present dimensionality reduction methods for non-PSD and square-roots" matrices. We show how these techniques can be used for multiple downstream tasks.
arXiv Detail & Related papers (2021-06-16T04:07:48Z)
Hutch++: Optimal Stochastic Trace Estimation [75.45968495410048]
We introduce a new randomized algorithm, Hutch++, which computes a $(1 pm epsilon)$ approximation to $tr(A)$ for any positive semidefinite (PSD) $A$. We show that it significantly outperforms Hutchinson's method in experiments.
arXiv Detail & Related papers (2020-10-19T16:45:37Z)
Interpolating Log-Determinant and Trace of the Powers of Matrix $\mathbf{A} + t \mathbf{B}$ [1.5002438468152661]
We develop methods for the functions $t mapsto log det left( mathbfA + t mathbfB right)$ and $t mapsto nametraceleft (mathbfA + t mathbfB)p right)$ where the matrices $mathbfA$ and $mathbfB$ are Hermitian and positive (semi) definite and $p$ and $t$ are real variables.
arXiv Detail & Related papers (2020-09-15T23:11:17Z)
Convergence of Sparse Variational Inference in Gaussian Processes Regression [29.636483122130027]
We show that a method with an overall computational cost of $mathcalO(log N)2D(loglog N)2)$ can be used to perform inference.
arXiv Detail & Related papers (2020-08-01T19:23:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.