How Does Adaptive Optimization Impact Local Neural Network Geometry?
- URL: http://arxiv.org/abs/2211.02254v1
- Date: Fri, 4 Nov 2022 04:05:57 GMT
- Title: How Does Adaptive Optimization Impact Local Neural Network Geometry?
- Authors: Kaiqi Jiang, Dhruv Malik, Yuanzhi Li
- Abstract summary: We argue that in the context of neural network optimization, this traditional viewpoint is insufficient.
We show that adaptive methods such as Adam bias the trajectories towards regions where one might expect faster convergence.
- Score: 32.32593743852949
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive optimization methods are well known to achieve superior convergence
relative to vanilla gradient methods. The traditional viewpoint in
optimization, particularly in convex optimization, explains this improved
performance by arguing that, unlike vanilla gradient schemes, adaptive
algorithms mimic the behavior of a second-order method by adapting to the
global geometry of the loss function. We argue that in the context of neural
network optimization, this traditional viewpoint is insufficient. Instead, we
advocate for a local trajectory analysis. For iterate trajectories produced by
running a generic optimization algorithm OPT, we introduce
$R^{\text{OPT}}_{\text{med}}$, a statistic that is analogous to the condition
number of the loss Hessian evaluated at the iterates. Through extensive
experiments, we show that adaptive methods such as Adam bias the trajectories
towards regions where $R^{\text{Adam}}_{\text{med}}$ is small, where one might
expect faster convergence. By contrast, vanilla gradient methods like SGD bias
the trajectories towards regions where $R^{\text{SGD}}_{\text{med}}$ is
comparatively large. We complement these empirical observations with a
theoretical result that provably demonstrates this phenomenon in the simplified
setting of a two-layer linear network. We view our findings as evidence for the
need of a new explanation of the success of adaptive methods, one that is
different than the conventional wisdom.
Related papers
- Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Neural Gradient Learning and Optimization for Oriented Point Normal
Estimation [53.611206368815125]
We propose a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation.
We learn an angular distance field based on local plane geometry to refine the coarse gradient vectors.
Our method efficiently conducts global gradient approximation while achieving better accuracy and ability generalization of local feature description.
arXiv Detail & Related papers (2023-09-17T08:35:11Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - A Particle-based Sparse Gaussian Process Optimizer [5.672919245950197]
We present a new swarm-swarm-based framework utilizing the underlying dynamical process of descent.
The biggest advantage of this approach is greater exploration around the current state before deciding descent descent.
arXiv Detail & Related papers (2022-11-26T09:06:15Z) - Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL)
New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous
Neural Networks [21.63353575405414]
We study the implicit bias of adaptive optimization algorithms on homogeneous neural networks.
It is the first work to study the convergent direction of adaptive optimizations on non-linear deep neural networks.
arXiv Detail & Related papers (2020-12-11T11:15:32Z) - BAMSProd: A Step towards Generalizing the Adaptive Optimization Methods
to Deep Binary Model [34.093978443640616]
Recent methods have significantly reduced the performance of Binary Neural Networks (BNNs)
guaranteeing the effective and efficient training of BNNs is an unsolved problem.
We propose a BAMSProd algorithm with a key observation that the convergence property of optimizing deep binary model is strongly related to the quantization errors.
arXiv Detail & Related papers (2020-09-29T06:12:32Z) - Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite
Epochs [25.158203665218164]
We show that adaptive gradient methods can be faster than random shuffling SGD after finite time.
To the best of our knowledge, it is the first to demonstrate that adaptive gradient methods can be faster than SGD after finite time.
arXiv Detail & Related papers (2020-06-12T09:39:47Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.