Related papers: Gradients of Functions of Large Matrices

Gradients of Functions of Large Matrices

URL: http://arxiv.org/abs/2405.17277v2
Date: Thu, 24 Oct 2024 15:04:19 GMT
Title: Gradients of Functions of Large Matrices
Authors: Nicholas Krämer, Pablo Moreno-Muñoz, Hrittik Roy, Søren Hauberg,
Abstract summary: We show how to differentiate workhorses of numerical linear algebra efficiently. We derive previously unknown adjoint systems for Lanczos and Arnoldi iterations, implement them in JAX, and show that the resulting code can compete with Diffrax. All this is achieved without any problem-specific code optimisation.
Score: 18.361820028457718
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tuning scientific and probabilistic machine learning models $-$ for example, partial differential equations, Gaussian processes, or Bayesian neural networks $-$ often relies on evaluating functions of matrices whose size grows with the data set or the number of parameters. While the state-of-the-art for evaluating these quantities is almost always based on Lanczos and Arnoldi iterations, the present work is the first to explain how to differentiate these workhorses of numerical linear algebra efficiently. To get there, we derive previously unknown adjoint systems for Lanczos and Arnoldi iterations, implement them in JAX, and show that the resulting code can compete with Diffrax when it comes to differentiating PDEs, GPyTorch for selecting Gaussian process models and beats standard factorisation methods for calibrating Bayesian neural networks. All this is achieved without any problem-specific code optimisation. Find the code at https://github.com/pnkraemer/experiments-lanczos-adjoints and install the library with pip install matfree.

Related papers

The fast committor machine: Interpretable prediction with kernels [0.0]
This paper introduces an efficient algorithm for approximating the committor, called the "fast committor machine" (FCM) The kernel function is constructed to emphasize low-dimensional subspaces that optimally describe the $A$ to $B$ transitions. The FCM yields higher accuracy and trains more quickly than a neural network with the same number of parameters.
arXiv Detail & Related papers (2024-05-16T19:22:49Z)
Cramer Type Distances for Learning Gaussian Mixture Models by Gradient Descent [0.0]
As of today, few known algorithms can fit or learn Gaussian mixture models. We propose a distance function called Sliced Cram'er 2-distance for learning general multivariate GMMs. These features are especially useful for distributional reinforcement learning and Deep Q Networks.
arXiv Detail & Related papers (2023-07-13T13:43:02Z)
Fast variable selection makes scalable Gaussian process BSS-ANOVA a speedy and accurate choice for tabular and time series regression [0.0]
Gaussian processes (GPs) are non-parametric regression engines with a long history. One of a number of scalable GP approaches is the Karhunen-Lo'eve (KL) decomposed kernel BSS-ANOVA, developed in 2009. A new method of forward variable selection, quickly and effectively limits the number of terms, yielding a method with competitive accuracies.
arXiv Detail & Related papers (2022-05-26T23:41:43Z)
Gaussian Processes and Statistical Decision-making in Non-Euclidean Spaces [96.53463532832939]
We develop techniques for broadening the applicability of Gaussian processes. We introduce a wide class of efficient approximations built from this viewpoint. We develop a collection of Gaussian process models over non-Euclidean spaces.
arXiv Detail & Related papers (2022-02-22T01:42:57Z)
Adjoint-aided inference of Gaussian process driven differential equations [0.8257490175399691]
We show how the adjoint of a linear system can be used to efficiently infer forcing functions modelled as GPs. We demonstrate the approach on systems of both ordinary and partial differential equations.
arXiv Detail & Related papers (2022-02-09T17:35:14Z)
Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks. In this work, we compare Adam based variants based on the difference between the present and the past gradients. We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z)
Mat\'ern Gaussian Processes on Graphs [67.13902825728718]
We leverage the partial differential equation characterization of Mat'ern Gaussian processes to study their analog for undirected graphs. We show that the resulting Gaussian processes inherit various attractive properties of their Euclidean and Euclidian analogs. This enables graph Mat'ern Gaussian processes to be employed in mini-batch and non-conjugate settings.
arXiv Detail & Related papers (2020-10-29T13:08:07Z)
Linear-Sample Learning of Low-Rank Distributions [56.59844655107251]
We show that learning $ktimes k$, rank-$r$, matrices to normalized $L_1$ distance requires $Omega(frackrepsilon2)$ samples. We propose an algorithm that uses $cal O(frackrepsilon2log2fracepsilon)$ samples, a number linear in the high dimension, and nearly linear in the matrices, typically low, rank proofs.
arXiv Detail & Related papers (2020-09-30T19:10:32Z)
Multipole Graph Neural Operator for Parametric Partial Differential Equations [57.90284928158383]
One of the main challenges in using deep learning-based methods for simulating physical systems is formulating physics-based data. We propose a novel multi-level graph neural network framework that captures interaction at all ranges with only linear complexity. Experiments confirm our multi-graph network learns discretization-invariant solution operators to PDEs and can be evaluated in linear time.
arXiv Detail & Related papers (2020-06-16T21:56:22Z)
Quadruply Stochastic Gaussian Processes [10.152838128195466]
We introduce a variational inference procedure for training scalable Gaussian process (GP) models whose per-iteration complexity is independent of both the number of training points, $n$, and the number basis functions used in the kernel approximation, $m$. We demonstrate accurate inference on large classification and regression datasets using GPs and relevance vector machines with up to $m = 107$ basis functions.
arXiv Detail & Related papers (2020-06-04T17:06:25Z)
Learning Gaussian Graphical Models via Multiplicative Weights [54.252053139374205]
We adapt an algorithm of Klivans and Meka based on the method of multiplicative weight updates. The algorithm enjoys a sample complexity bound that is qualitatively similar to others in the literature. It has a low runtime $O(mp2)$ in the case of $m$ samples and $p$ nodes, and can trivially be implemented in an online manner.
arXiv Detail & Related papers (2020-02-20T10:50:58Z)
Particle-Gibbs Sampling For Bayesian Feature Allocation Models [77.57285768500225]
Most widely used MCMC strategies rely on an element wise Gibbs update of the feature allocation matrix. We have developed a Gibbs sampler that can update an entire row of the feature allocation matrix in a single move. This sampler is impractical for models with a large number of features as the computational complexity scales exponentially in the number of features. We develop a Particle Gibbs sampler that targets the same distribution as the row wise Gibbs updates, but has computational complexity that only grows linearly in the number of features.
arXiv Detail & Related papers (2020-01-25T22:11:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.