Low-Precision Arithmetic for Fast Gaussian Processes
- URL: http://arxiv.org/abs/2207.06856v1
- Date: Thu, 14 Jul 2022 12:20:46 GMT
- Title: Low-Precision Arithmetic for Fast Gaussian Processes
- Authors: Wesley J. Maddox, Andres Potapczynski, Andrew Gordon Wilson
- Abstract summary: Low-precision arithmetic has had a transformative effect on the training of neural networks.
We propose a multi-faceted approach involving conjugate gradients with re-orthogonalization, mixed precision, and preconditioning.
Our approach significantly improves the numerical stability and practical performance of conjugate gradients in low-precision over a wide range of settings.
- Score: 39.720581185327816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Low-precision arithmetic has had a transformative effect on the training of
neural networks, reducing computation, memory and energy requirements. However,
despite its promise, low-precision arithmetic has received little attention for
Gaussian processes (GPs), largely because GPs require sophisticated linear
algebra routines that are unstable in low-precision. We study the different
failure modes that can occur when training GPs in half precision. To circumvent
these failure modes, we propose a multi-faceted approach involving conjugate
gradients with re-orthogonalization, mixed precision, and preconditioning. Our
approach significantly improves the numerical stability and practical
performance of conjugate gradients in low-precision over a wide range of
settings, enabling GPs to train on $1.8$ million data points in $10$ hours on a
single GPU, without any sparse approximations.
Related papers
- Beyond adaptive gradient: Fast-Controlled Minibatch Algorithm for large-scale optimization [1.6749379740049926]
We introduce F-CMA, a Fast-Controlled Mini-batch Algorithm with a random reshuffling method featuring a sufficient decrease condition and a line-search procedure to ensure loss reduction per epoch.
Tests show significant improvements, including a decrease in the overall training time by 68%, an increase in per-epoch efficiency by up to 20%, and in model accuracy by up to 5%.
arXiv Detail & Related papers (2024-11-24T11:46:47Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Guaranteed Approximation Bounds for Mixed-Precision Neural Operators [83.64404557466528]
We build on intuition that neural operator learning inherently induces an approximation error.
We show that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.
arXiv Detail & Related papers (2023-07-27T17:42:06Z) - Non-Convergence and Limit Cycles in the Adam optimizer [0.0]
We show that limit cycles of period 2 exist in batch mode for simple quadratic objective functions.
We analyze the stability of these limit cycles and relate our analysis to other results where approximate convergence was shown.
arXiv Detail & Related papers (2022-10-05T07:44:33Z) - Revisiting Active Sets for Gaussian Process Decoders [0.0]
We develop a new estimate of the log-marginal likelihood based on recently discovered links to cross-validation.
We demonstrate that the resulting active sets (SAS) approximation significantly improves the robustness of GP decoder training.
arXiv Detail & Related papers (2022-09-10T10:49:31Z) - Faster One-Sample Stochastic Conditional Gradient Method for Composite
Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms.
The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z) - When are Iterative Gaussian Processes Reliably Accurate? [38.523693700243975]
Lanczos decompositions have achieved scalable Gaussian process inference with highly accurate point predictions.
We investigate CG tolerance, preconditioner rank, and Lanczos decomposition rank.
We show that LGS-BFB is a compelling for Iterative GPs, achieving convergence with fewer updates.
arXiv Detail & Related papers (2021-12-31T00:02:18Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.