Related papers: Low-Precision Arithmetic for Fast Gaussian Processes

Low-Precision Arithmetic for Fast Gaussian Processes

URL: http://arxiv.org/abs/2207.06856v1
Date: Thu, 14 Jul 2022 12:20:46 GMT
Title: Low-Precision Arithmetic for Fast Gaussian Processes
Authors: Wesley J. Maddox, Andres Potapczynski, Andrew Gordon Wilson
Abstract summary: Low-precision arithmetic has had a transformative effect on the training of neural networks. We propose a multi-faceted approach involving conjugate gradients with re-orthogonalization, mixed precision, and preconditioning. Our approach significantly improves the numerical stability and practical performance of conjugate gradients in low-precision over a wide range of settings.
Score: 39.720581185327816
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Low-precision arithmetic has had a transformative effect on the training of neural networks, reducing computation, memory and energy requirements. However, despite its promise, low-precision arithmetic has received little attention for Gaussian processes (GPs), largely because GPs require sophisticated linear algebra routines that are unstable in low-precision. We study the different failure modes that can occur when training GPs in half precision. To circumvent these failure modes, we propose a multi-faceted approach involving conjugate gradients with re-orthogonalization, mixed precision, and preconditioning. Our approach significantly improves the numerical stability and practical performance of conjugate gradients in low-precision over a wide range of settings, enabling GPs to train on $1.8$ million data points in $10$ hours on a single GPU, without any sparse approximations.

Related papers

Beyond Gradient Averaging in Parallel Optimization: Improved Robustness through Gradient Agreement Filtering [36.896695278624776]
Traditional distributed data-parallel gradient descent involves averaging gradients of microbatches to calculate a macrobatch that is then used to update model parameters. We introduce a simple, computationally effective way to reduce gradient variance by computing the cosine distance between micro-gradients during training. We show this technique consistently outperforms validation accuracy in some cases by up to 18.2% compared to traditional training approaches.
arXiv Detail & Related papers (2024-12-24T00:00:11Z)
Beyond adaptive gradient: Fast-Controlled Minibatch Algorithm for large-scale optimization [1.6749379740049926]
We introduce F-CMA, a Fast-Controlled Mini-batch Algorithm with a random reshuffling method featuring a sufficient decrease condition and a line-search procedure to ensure loss reduction per epoch. Tests show significant improvements, including a decrease in the overall training time by 68%, an increase in per-epoch efficiency by up to 20%, and in model accuracy by up to 5%.
arXiv Detail & Related papers (2024-11-24T11:46:47Z)
Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective. We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices. Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z)
Guaranteed Approximation Bounds for Mixed-Precision Neural Operators [83.64404557466528]
We build on intuition that neural operator learning inherently induces an approximation error. We show that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.
arXiv Detail & Related papers (2023-07-27T17:42:06Z)
Non-Convergence and Limit Cycles in the Adam optimizer [0.0]
We show that limit cycles of period 2 exist in batch mode for simple quadratic objective functions. We analyze the stability of these limit cycles and relate our analysis to other results where approximate convergence was shown.
arXiv Detail & Related papers (2022-10-05T07:44:33Z)
Revisiting Active Sets for Gaussian Process Decoders [0.0]
We develop a new estimate of the log-marginal likelihood based on recently discovered links to cross-validation. We demonstrate that the resulting active sets (SAS) approximation significantly improves the robustness of GP decoder training.
arXiv Detail & Related papers (2022-09-10T10:49:31Z)
Faster One-Sample Stochastic Conditional Gradient Method for Composite Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z)
When are Iterative Gaussian Processes Reliably Accurate? [38.523693700243975]
Lanczos decompositions have achieved scalable Gaussian process inference with highly accurate point predictions. We investigate CG tolerance, preconditioner rank, and Lanczos decomposition rank. We show that LGS-BFB is a compelling for Iterative GPs, achieving convergence with fewer updates.
arXiv Detail & Related papers (2021-12-31T00:02:18Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.