Signal Processing Meets SGD: From Momentum to Filter
- URL: http://arxiv.org/abs/2311.02818v5
- Date: Wed, 22 May 2024 09:11:31 GMT
- Title: Signal Processing Meets SGD: From Momentum to Filter
- Authors: Zhipeng Yao, Guiyuan Fu, Ying Li, Yu Zhang, Dazhou Li, Rui Yu,
- Abstract summary: In deep learning, gradient descent (SGD) and its momentum-based variants are widely used for optimization.
We propose a novel optimization method designed to accelerate SGD's convergence without sacrificing generalization.
- Score: 6.751292200515353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In deep learning, stochastic gradient descent (SGD) and its momentum-based variants are widely used for optimization, but they typically suffer from slow convergence. Conversely, existing adaptive learning rate optimizers speed up convergence but often compromise generalization. To resolve this issue, we propose a novel optimization method designed to accelerate SGD's convergence without sacrificing generalization. Our approach reduces the variance of the historical gradient, improves first-order moment estimation of SGD by applying Wiener filter theory, and introduces a time-varying adaptive gain. Empirical results demonstrate that SGDF (SGD with Filter) effectively balances convergence and generalization compared to state-of-the-art optimizers.
Related papers
- Gradient Normalization with(out) Clipping Ensures Convergence of Nonconvex SGD under Heavy-Tailed Noise with Improved Results [60.92029979853314]
This paper investigates Gradient Normalization without (NSGDC) its gradient reduction variant (NSGDC-VR)
We present significant improvements in the theoretical results for both algorithms.
arXiv Detail & Related papers (2024-10-21T22:40:42Z) - Faster Convergence of Stochastic Accelerated Gradient Descent under Interpolation [51.248784084461334]
We prove new convergence rates for a generalized version of Nesterov acceleration underrho conditions.
Our analysis reduces the dependence on the strong growth constant from $$ to $sqrt$ as compared to prior work.
arXiv Detail & Related papers (2024-04-03T00:41:19Z) - Score-Guided Intermediate Layer Optimization: Fast Langevin Mixing for
Inverse Problem [97.64313409741614]
We prove fast mixing and characterize the stationary distribution of the Langevin Algorithm for inverting random weighted DNN generators.
We propose to do posterior sampling in the latent space of a pre-trained generative model.
arXiv Detail & Related papers (2022-06-18T03:47:37Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - AdaL: Adaptive Gradient Transformation Contributes to Convergences and
Generalizations [4.991328448898387]
We propose AdaL, with a transformation on the original gradient.
AdaL accelerates the convergence by amplifying the gradient in the early stage, as well as dampens the oscillation and stabilizes the optimization by shrinking the gradient later.
arXiv Detail & Related papers (2021-07-04T02:55:36Z) - Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to
Improve Generalization [89.7882166459412]
gradient noise (SGN) acts as implicit regularization for deep learning.
Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning.
For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach.
arXiv Detail & Related papers (2021-03-31T16:08:06Z) - The Role of Momentum Parameters in the Optimal Convergence of Adaptive
Polyak's Heavy-ball Methods [12.93796690939018]
We prove that the adaptive Polyak's Heavy-ball (HB) method attains an optimal individual convergence rate of $O(frac1sqrtt)$.
Our new analysis shows how the HB momentum and its time-varying weight help us to achieve the acceleration in convex optimization.
arXiv Detail & Related papers (2021-02-15T02:57:14Z) - Convergence Analysis of Homotopy-SGD for non-convex optimization [43.71213126039448]
We present a first-order algorithm based on a combination of homotopy methods and SGD, called Gradienty-Stoch Descent (H-SGD)
Under some assumptions, we conduct a theoretical analysis of the proposed problem.
Experimental results show that H-SGD can outperform SGD.
arXiv Detail & Related papers (2020-11-20T09:50:40Z) - Obtaining Adjustable Regularization for Free via Iterate Averaging [43.75491612671571]
Regularization for optimization is a crucial technique to avoid overfitting in machine learning.
We establish an averaging scheme that converts the iterates of SGD on an arbitrary strongly convex and smooth objective function to its regularized counterpart.
Our approaches can be used for accelerated and preconditioned optimization methods as well.
arXiv Detail & Related papers (2020-08-15T15:28:05Z) - When Does Preconditioning Help or Hurt Generalization? [74.25170084614098]
We show how the textitimplicit bias of first and second order methods affects the comparison of generalization properties.
We discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between GD and NGD.
arXiv Detail & Related papers (2020-06-18T17:57:26Z) - Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite
Epochs [25.158203665218164]
We show that adaptive gradient methods can be faster than random shuffling SGD after finite time.
To the best of our knowledge, it is the first to demonstrate that adaptive gradient methods can be faster than SGD after finite time.
arXiv Detail & Related papers (2020-06-12T09:39:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.