Align, then memorise: the dynamics of learning with feedback alignment
- URL: http://arxiv.org/abs/2011.12428v2
- Date: Thu, 10 Jun 2021 14:20:37 GMT
- Title: Align, then memorise: the dynamics of learning with feedback alignment
- Authors: Maria Refinetti, St\'ephane d'Ascoli, Ruben Ohana, Sebastian Goldt
- Abstract summary: Direct Feedback Alignment (DFA) is an efficient alternative to the ubiquitous backpropagation algorithm for training deep neural networks.
DFA successfully trains state-of-the-art models such as Transformers, but it notoriously fails to train convolutional networks.
Here, we propose a theory for the success of DFA.
- Score: 12.587037358391418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Direct Feedback Alignment (DFA) is emerging as an efficient and biologically
plausible alternative to the ubiquitous backpropagation algorithm for training
deep neural networks. Despite relying on random feedback weights for the
backward pass, DFA successfully trains state-of-the-art models such as
Transformers. On the other hand, it notoriously fails to train convolutional
networks. An understanding of the inner workings of DFA to explain these
diverging results remains elusive. Here, we propose a theory for the success of
DFA. We first show that learning in shallow networks proceeds in two steps: an
alignment phase, where the model adapts its weights to align the approximate
gradient with the true gradient of the loss function, is followed by a
memorisation phase, where the model focuses on fitting the data. This two-step
process has a degeneracy breaking effect: out of all the low-loss solutions in
the landscape, a network trained with DFA naturally converges to the solution
which maximises gradient alignment. We also identify a key quantity underlying
alignment in deep linear networks: the conditioning of the alignment matrices.
The latter enables a detailed understanding of the impact of data structure on
alignment, and suggests a simple explanation for the well-known failure of DFA
to train convolutional neural networks. Numerical experiments on MNIST and
CIFAR10 clearly demonstrate degeneracy breaking in deep non-linear networks and
show that the align-then-memorise process occurs sequentially from the bottom
layers of the network to the top.
Related papers
- The Law of Parsimony in Gradient Descent for Learning Deep Linear
Networks [34.85235641812005]
We reveal a surprising "law of parsimony" in the learning dynamics when the data possesses low-dimensional structures.
This simplicity in learning dynamics could have significant implications for both efficient training and a better understanding of deep networks.
arXiv Detail & Related papers (2023-06-01T21:24:53Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Adversarial Examples Exist in Two-Layer ReLU Networks for Low
Dimensional Linear Subspaces [24.43191276129614]
We show that standard methods lead to non-robust neural networks.
We show that decreasing the scale of the training algorithm, or adding $L$ regularization, can make the trained network more robust to adversarial perturbations.
arXiv Detail & Related papers (2023-03-01T19:10:05Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Training invariances and the low-rank phenomenon: beyond linear networks [44.02161831977037]
We show that when one trains a deep linear network with logistic or exponential loss on linearly separable data, the weights converge to rank-$1$ matrices.
This is the first time a low-rank phenomenon is proven rigorously for nonlinear ReLU-activated feedforward networks.
Our proof relies on a specific decomposition of the network into a multilinear function and another ReLU network whose weights are constant under a certain parameter directional convergence.
arXiv Detail & Related papers (2022-01-28T07:31:19Z) - Gradient-trained Weights in Wide Neural Networks Align Layerwise to
Error-scaled Input Correlations [11.176824373696324]
We derive the layerwise weight dynamics of infinite-width neural networks with nonlinear activations trained by gradient descent.
We formulate backpropagation-free learning rules, named Align-zero and Align-ada, that theoretically achieve the same alignment as backpropagation.
arXiv Detail & Related papers (2021-06-15T21:56:38Z) - Solving Sparse Linear Inverse Problems in Communication Systems: A Deep
Learning Approach With Adaptive Depth [51.40441097625201]
We propose an end-to-end trainable deep learning architecture for sparse signal recovery problems.
The proposed method learns how many layers to execute to emit an output, and the network depth is dynamically adjusted for each task in the inference phase.
arXiv Detail & Related papers (2020-10-29T06:32:53Z) - Deep Networks from the Principle of Rate Reduction [32.87280757001462]
This work attempts to interpret modern deep (convolutional) networks from the principles of rate reduction and (shift) invariant classification.
We show that the basic iterative ascent gradient scheme for optimizing the rate reduction of learned features naturally leads to a multi-layer deep network, one iteration per layer.
All components of this "white box" network have precise optimization, statistical, and geometric interpretation.
arXiv Detail & Related papers (2020-10-27T06:01:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.