Momentum Doesn't Change the Implicit Bias
- URL: http://arxiv.org/abs/2110.03891v1
- Date: Fri, 8 Oct 2021 04:37:18 GMT
- Title: Momentum Doesn't Change the Implicit Bias
- Authors: Bohan Wang, Qi Meng, Huishuai Zhang, Ruoyu Sun, Wei Chen, Zhi-Ming Ma
- Abstract summary: We analyze the implicit bias of momentum-based optimization.
We construct new Lyapunov functions as a tool to analyze the gap between the model parameter and the max-margin solution.
- Score: 36.301490759243876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The momentum acceleration technique is widely adopted in many optimization
algorithms. However, the theoretical understanding of how the momentum affects
the generalization performance of the optimization algorithms is still unknown.
In this paper, we answer this question through analyzing the implicit bias of
momentum-based optimization. We prove that both SGD with momentum and Adam
converge to the $L_2$ max-margin solution for exponential-tailed loss, which is
the same as vanilla gradient descent. That means, these optimizers with
momentum acceleration still converge to a model with low complexity, which
provides guarantees on their generalization. Technically, to overcome the
difficulty brought by the error accumulation in analyzing the momentum, we
construct new Lyapunov functions as a tool to analyze the gap between the model
parameter and the max-margin solution.
Related papers
- A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - Accelerated forward-backward and Douglas-Rachford splitting dynamics [0.0]
We study convergence properties of continuous-time variants of accelerated Forward-Backward (FB) and Douglas-Rachford (DR) splitting algorithms.
For FB splitting dynamics, we demonstrate that accelerated exponential convergence rate carries over to general strongly convex problems.
arXiv Detail & Related papers (2024-07-30T07:52:22Z) - Learning to Optimize with Stochastic Dominance Constraints [103.26714928625582]
In this paper, we develop a simple yet efficient approach for the problem of comparing uncertain quantities.
We recast inner optimization in the Lagrangian as a learning problem for surrogate approximation, which bypasses apparent intractability.
The proposed light-SD demonstrates superior performance on several representative problems ranging from finance to supply chain management.
arXiv Detail & Related papers (2022-11-14T21:54:31Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Optimizing Information-theoretical Generalization Bounds via Anisotropic
Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD.
We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z) - Just a Momentum: Analytical Study of Momentum-Based Acceleration Methods
in Paradigmatic High-Dimensional Non-Convex Problem [12.132641563193584]
When over loss functions it is common practice to use momentum-based methods rather than vanilla gradient-based loss method.
We show how having a mass increases the effective step ball dynamics dynamics leading to up.
arXiv Detail & Related papers (2021-02-23T15:30:57Z) - The Role of Momentum Parameters in the Optimal Convergence of Adaptive
Polyak's Heavy-ball Methods [12.93796690939018]
We prove that the adaptive Polyak's Heavy-ball (HB) method attains an optimal individual convergence rate of $O(frac1sqrtt)$.
Our new analysis shows how the HB momentum and its time-varying weight help us to achieve the acceleration in convex optimization.
arXiv Detail & Related papers (2021-02-15T02:57:14Z) - A New Accelerated Stochastic Gradient Method with Momentum [4.967897656554012]
gradient descent with momentum (Sgdm) use weights that decay exponentially with the iteration times to generate an momentum term.
We provide theoretical convergence properties analyses for our method, which show both the exponentially decay weights and our inverse proportionally decay weights can limit the variance of the moving direction of parameters to be optimized to a region.
arXiv Detail & Related papers (2020-05-31T03:04:32Z) - Revisiting SGD with Increasingly Weighted Averaging: Optimization and
Generalization Perspectives [50.12802772165797]
The averaging technique combines all iterative solutions into a single solution.
Experiments have demonstrated trade-off and the effectiveness of averaging compared with other averaging schemes.
arXiv Detail & Related papers (2020-03-09T18:14:00Z) - Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum
under Heavy-Tailed Gradient Noise [39.9241638707715]
We show that FULD has similarities with enatural and egradient methods on their role in deep learning.
arXiv Detail & Related papers (2020-02-13T18:04:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.