Evolutionary Algorithms in the Light of SGD: Limit Equivalence, Minima
Flatness, and Transfer Learning
- URL: http://arxiv.org/abs/2306.09991v1
- Date: Sat, 20 May 2023 22:26:44 GMT
- Title: Evolutionary Algorithms in the Light of SGD: Limit Equivalence, Minima
Flatness, and Transfer Learning
- Authors: Andrei Kucharavy, Rachid Guerraoui and Ljiljana Dolamic
- Abstract summary: We show that a class of evolutionary algorithms (EAs) inspired by the Gillespie-Orr Mutational Landscapes model for natural evolution is formally equivalent to the Gradient Descent (SGD)
We then show that for ANNs trained to near-optimality or in the transfer learning setting, the equivalence also allows transferring the insights from the Mutational Landscapes model to SGD.
- Score: 7.262048441360132
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Whenever applicable, the Stochastic Gradient Descent (SGD) has shown itself
to be unreasonably effective. Instead of underperforming and getting trapped in
local minima due to the batch noise, SGD leverages it to learn to generalize
better and find minima that are good enough for the entire dataset. This led to
numerous theoretical and experimental investigations, especially in the context
of Artificial Neural Networks (ANNs), leading to better machine learning
algorithms. However, SGD is not applicable in a non-differentiable setting,
leaving all that prior research off the table.
In this paper, we show that a class of evolutionary algorithms (EAs) inspired
by the Gillespie-Orr Mutational Landscapes model for natural evolution is
formally equivalent to SGD in certain settings and, in practice, is well
adapted to large ANNs. We refer to such EAs as Gillespie-Orr EA class (GO-EAs)
and empirically show how an insight transfer from SGD can work for them. We
then show that for ANNs trained to near-optimality or in the transfer learning
setting, the equivalence also allows transferring the insights from the
Mutational Landscapes model to SGD.
We then leverage this equivalence to experimentally show how SGD and GO-EAs
can provide mutual insight through examples of minima flatness, transfer
learning, and mixing of individuals in EAs applied to large models.
Related papers
- Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation [3.6185342807265415]
It remains an open problem of research to explain the success and the limitations of SGD methods in rigorous theoretical terms.
In this work we prove for a large class of SGD methods that the considered does with high probability not converge to global minimizers of the optimization problem.
The general non-convergence results of this work do not only apply to the plain vanilla standard SGD method but also to a large class of accelerated and adaptive SGD methods.
arXiv Detail & Related papers (2024-10-14T14:11:37Z) - The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization [4.7256945641654164]
gradient descent (SGD) is a widely used algorithm in machine learning, particularly for neural network training.
Recent studies on SGD for canonical quadratic optimization or linear regression show it attains well generalization under suitable high-dimensional settings.
This paper investigates SGD with two essential components in practice: exponentially decaying step size schedule and momentum.
arXiv Detail & Related papers (2024-09-15T14:20:03Z) - Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates [3.6185342807265415]
Deep learning algorithms are the key ingredients in many artificial intelligence (AI) systems.
Deep learning algorithms are typically consisting of a class of deep neural networks trained by a gradient descent (SGD) optimization method.
arXiv Detail & Related papers (2024-07-11T00:10:35Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication [37.210933391984014]
Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice.
We provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions.
We also show the min-max optimality of accelerated mini-batch SGD for several problem classes.
arXiv Detail & Related papers (2024-05-19T20:20:03Z) - Benign Oscillation of Stochastic Gradient Descent with Large Learning
Rates [21.8377731053374]
We investigate the generalization properties of neural networks (NN) trained by gradient descent (SGD) algorithm with large learning rates.
Under such a training regime, our finding is that, the oscillation of the NN weights caused by the large learning rate SGD training turns out to be beneficial to the generalization of the NN.
arXiv Detail & Related papers (2023-10-26T00:35:40Z) - On-Device Domain Generalization [93.79736882489982]
Domain generalization is critical to on-device machine learning applications.
We find that knowledge distillation is a strong candidate for solving the problem.
We propose a simple idea called out-of-distribution knowledge distillation (OKD), which aims to teach the student how the teacher handles (synthetic) out-of-distribution data.
arXiv Detail & Related papers (2022-09-15T17:59:31Z) - Understanding Overparameterization in Generative Adversarial Networks [56.57403335510056]
Generative Adversarial Networks (GANs) are used to train non- concave mini-max optimization problems.
A theory has shown the importance of the gradient descent (GD) to globally optimal solutions.
We show that in an overized GAN with a $1$-layer neural network generator and a linear discriminator, the GDA converges to a global saddle point of the underlying non- concave min-max problem.
arXiv Detail & Related papers (2021-04-12T16:23:37Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate
and Momentum [97.84312669132716]
We disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and flat minima selection.
Our experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods.
arXiv Detail & Related papers (2020-06-29T05:21:02Z) - Interpretable Learning-to-Rank with Generalized Additive Models [78.42800966500374]
Interpretability of learning-to-rank models is a crucial yet relatively under-examined research area.
Recent progress on interpretable ranking models largely focuses on generating post-hoc explanations for existing black-box ranking models.
We lay the groundwork for intrinsically interpretable learning-to-rank by introducing generalized additive models (GAMs) into ranking tasks.
arXiv Detail & Related papers (2020-05-06T01:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.