Low-Precision Stochastic Gradient Langevin Dynamics
- URL: http://arxiv.org/abs/2206.09909v1
- Date: Mon, 20 Jun 2022 17:25:41 GMT
- Title: Low-Precision Stochastic Gradient Langevin Dynamics
- Authors: Ruqi Zhang, Andrew Gordon Wilson, Christopher De Sa
- Abstract summary: We provide the first study of low-precision Gradient Langevin Dynamics, showing that its costs can be significantly reduced without sacrificing performance.
We develop a new quantization function for SGLD that preserves the variance in each update step.
We demonstrate that low-precision SGLD achieves comparable performance to full-precision SGLD with only 8 bits on a variety of deep learning tasks.
- Score: 70.69923368584588
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While low-precision optimization has been widely used to accelerate deep
learning, low-precision sampling remains largely unexplored. As a consequence,
sampling is simply infeasible in many large-scale scenarios, despite providing
remarkable benefits to generalization and uncertainty estimation for neural
networks. In this paper, we provide the first study of low-precision Stochastic
Gradient Langevin Dynamics (SGLD), showing that its costs can be significantly
reduced without sacrificing performance, due to its intrinsic ability to handle
system noise. We prove that the convergence of low-precision SGLD with
full-precision gradient accumulators is less affected by the quantization error
than its SGD counterpart in the strongly convex setting. To further enable
low-precision gradient accumulators, we develop a new quantization function for
SGLD that preserves the variance in each update step. We demonstrate that
low-precision SGLD achieves comparable performance to full-precision SGLD with
only 8 bits on a variety of deep learning tasks.
Related papers
- Gradient Normalization with(out) Clipping Ensures Convergence of Nonconvex SGD under Heavy-Tailed Noise with Improved Results [60.92029979853314]
This paper investigates Gradient Normalization without (NSGDC) its gradient reduction variant (NSGDC-VR)
We present significant improvements in the theoretical results for both algorithms.
arXiv Detail & Related papers (2024-10-21T22:40:42Z) - The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization [4.7256945641654164]
gradient descent (SGD) is a widely used algorithm in machine learning, particularly for neural network training.
Recent studies on SGD for canonical quadratic optimization or linear regression show it attains well generalization under suitable high-dimensional settings.
This paper investigates SGD with two essential components in practice: exponentially decaying step size schedule and momentum.
arXiv Detail & Related papers (2024-09-15T14:20:03Z) - Langevin Dynamics: A Unified Perspective on Optimization via Lyapunov Potentials [15.718093624695552]
We analyze the convergence of Gradient Langevin Dynamics (SGLD) to global minima based on Lyapunov potentials and optimization.
We provide 1) improved in the setting of previous works SGLD for optimization, 2) first finite gradient complexity for SGLD, and 3) prove if continuous-time Langevin Dynamics succeeds for optimization, then discrete-time SGLD succeeds under mild regularity assumptions.
arXiv Detail & Related papers (2024-07-05T05:34:10Z) - Enhancing Low-Precision Sampling via Stochastic Gradient Hamiltonian Monte Carlo [19.99475224693552]
Low-precision training has emerged as a promising low-cost technique to enhance the training efficiency of deep neural networks.
This paper investigates low-precision sampling via Gradient Hamiltonian Monte Carlo (SGHMC) with low-precision and full-precision gradient accumulators.
arXiv Detail & Related papers (2023-10-25T03:06:48Z) - The Marginal Value of Momentum for Small Learning Rate SGD [20.606430391298815]
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without gradient noise regimes.
Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training where the optimal learning rate is not very large.
arXiv Detail & Related papers (2023-07-27T21:01:26Z) - Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Stochastic Gradient Langevin Dynamics with Variance Reduction [6.243995448840211]
gradient Langevin dynamics (SGLD) has gained the attention of global optimization researchers.
This paper proves an improved non objective functions using accelerated property reductions.
arXiv Detail & Related papers (2021-02-12T20:22:56Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.