Lottery Tickets in Evolutionary Optimization: On Sparse
Backpropagation-Free Trainability
- URL: http://arxiv.org/abs/2306.00045v1
- Date: Wed, 31 May 2023 15:58:54 GMT
- Title: Lottery Tickets in Evolutionary Optimization: On Sparse
Backpropagation-Free Trainability
- Authors: Robert Tjarko Lange, Henning Sprekeler
- Abstract summary: We study gradient descent (GD)-based sparse training and evolution strategies (ES)
We find that ES explore diverse and flat local optima and do not preserve linear mode connectivity across sparsity levels and independent runs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Is the lottery ticket phenomenon an idiosyncrasy of gradient-based training
or does it generalize to evolutionary optimization? In this paper we establish
the existence of highly sparse trainable initializations for evolution
strategies (ES) and characterize qualitative differences compared to gradient
descent (GD)-based sparse training. We introduce a novel signal-to-noise
iterative pruning procedure, which incorporates loss curvature information into
the network pruning step. This can enable the discovery of even sparser
trainable network initializations when using black-box evolution as compared to
GD-based optimization. Furthermore, we find that these initializations encode
an inductive bias, which transfers across different ES, related tasks and even
to GD-based training. Finally, we compare the local optima resulting from the
different optimization paradigms and sparsity levels. In contrast to GD, ES
explore diverse and flat local optima and do not preserve linear mode
connectivity across sparsity levels and independent runs. The results highlight
qualitative differences between evolution and gradient-based learning dynamics,
which can be uncovered by the study of iterative pruning procedures.
Related papers
- MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training.
We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - Variational Stochastic Gradient Descent for Deep Neural Networks [16.96187187108041]
Current state-of-the-arts are adaptive gradient-based optimization methods such as Adam.
Here, we propose to combine both approaches, resulting in the Variational Gradient Descent (VSGD)
We show how our VSGD method relates to other adaptive gradient-baseds like Adam.
arXiv Detail & Related papers (2024-04-09T18:02:01Z) - Evolution Transformer: In-Context Evolutionary Optimization [6.873777465945062]
We introduce Evolution Transformer, a causal Transformer architecture, which can flexibly characterize a family of Evolution Strategies.
We train the model weights using Evolutionary Algorithm Distillation, a technique for supervised optimization of sequence models.
We analyze the resulting properties of the Evolution Transformer and propose a technique to fully self-referentially train the Evolution Transformer.
arXiv Detail & Related papers (2024-03-05T14:04:13Z) - On discretisation drift and smoothness regularisation in neural network
training [0.0]
We aim to make steps towards an improved understanding of deep learning with a focus on optimisation and model regularisation.
We start by investigating gradient descent (GD), a discrete-time algorithm at the basis of most popular deep learning optimisation algorithms.
We derive novel continuous-time flows that account for discretisation drift. Unlike the NGF, these new flows can be used to describe learning rate specific behaviours of GD, such as training instabilities observed in supervised learning and two-player games.
We then translate insights from continuous time into mitigation strategies for unstable GD dynamics, by constructing novel learning rate schedules and regulariser
arXiv Detail & Related papers (2023-10-21T15:21:36Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Variance-Reduced Gradient Estimation via Noise-Reuse in Online Evolution
Strategies [50.10277748405355]
Noise-Reuse Evolution Strategies (NRES) is a general class of unbiased online evolution strategies methods.
We show NRES results in faster convergence than existing AD and ES methods in terms of wall-clock time and number of steps across a variety of applications.
arXiv Detail & Related papers (2023-04-21T17:53:05Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Direct Evolutionary Optimization of Variational Autoencoders With Binary
Latents [0.0]
We show that it is possible to train Variational Autoencoders (VAEs) with discrete latents without sampling-based approximation and re parameterization.
In contrast to large supervised networks, the here investigated VAEs can, e.g., denoise a single image without previous training on clean data and/or training on large image datasets.
arXiv Detail & Related papers (2020-11-27T12:42:12Z) - A Differential Game Theoretic Neural Optimizer for Training Residual
Networks [29.82841891919951]
We propose a generalized Differential Dynamic Programming (DDP) neural architecture that accepts both residual connections and convolution layers.
The resulting optimal control representation admits a gameoretic perspective, in which training residual networks can be interpreted as cooperative trajectory optimization on state-augmented systems.
arXiv Detail & Related papers (2020-07-17T10:19:17Z) - Dynamic Hierarchical Mimicking Towards Consistent Optimization
Objectives [73.15276998621582]
We propose a generic feature learning mechanism to advance CNN training with enhanced generalization ability.
Partially inspired by DSN, we fork delicately designed side branches from the intermediate layers of a given neural network.
Experiments on both category and instance recognition tasks demonstrate the substantial improvements of our proposed method.
arXiv Detail & Related papers (2020-03-24T09:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.