The boundary of neural network trainability is fractal
- URL: http://arxiv.org/abs/2402.06184v1
- Date: Fri, 9 Feb 2024 04:46:48 GMT
- Title: The boundary of neural network trainability is fractal
- Authors: Jascha Sohl-Dickstein
- Abstract summary: Some fractals are computed by iterating a function.
Neural network training can result in convergent or divergent behavior.
We find that this boundary is fractal over more than ten decades of scale in all tested configurations.
- Score: 23.4886323538853
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Some fractals -- for instance those associated with the Mandelbrot and
quadratic Julia sets -- are computed by iterating a function, and identifying
the boundary between hyperparameters for which the resulting series diverges or
remains bounded. Neural network training similarly involves iterating an update
function (e.g. repeated steps of gradient descent), can result in convergent or
divergent behavior, and can be extremely sensitive to small changes in
hyperparameters. Motivated by these similarities, we experimentally examine the
boundary between neural network hyperparameters that lead to stable and
divergent training. We find that this boundary is fractal over more than ten
decades of scale in all tested configurations.
Related papers
- Mapping the Edge of Chaos: Fractal-Like Boundaries in The Trainability of Decoder-Only Transformer Models [0.0]
Recent evidence from miniature neural networks suggests that the boundary separating these outcomes displays fractal characteristics.
This study extends them to medium-sized, decoder-only transformer architectures by employing a more consistent convergence measure.
The results show that the trainability frontier is not a simple threshold; rather, it forms a self-similar yet seemingly random structure at multiple scales.
arXiv Detail & Related papers (2025-01-08T05:24:11Z) - Topological obstruction to the training of shallow ReLU neural networks [0.0]
We study the interplay between the geometry of the loss landscape and the optimization trajectories of simple neural networks.
This paper reveals the presence of topological obstruction in the loss landscape of shallow ReLU neural networks trained using gradient flow.
arXiv Detail & Related papers (2024-10-18T19:17:48Z) - Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Complex fractal trainability boundary can arise from trivial non-convexity [0.13597551064547497]
We investigate the lossability properties that might lead to train fractal boundaries.
We identify "roughness perturbation", measures the gradient's sensitivity parameter changes.
Recent findings will lead to more consistent and predictable training strategies.
arXiv Detail & Related papers (2024-06-20T03:31:28Z) - Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels [78.6096486885658]
We introduce lower bounds to the linearized Laplace approximation of the marginal likelihood.
These bounds are amenable togradient-based optimization and allow to trade off estimation accuracy against computational complexity.
arXiv Detail & Related papers (2023-06-06T19:02:57Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Limitations of neural network training due to numerical instability of
backpropagation [2.255961793913651]
We study the training of deep neural networks by gradient descent where floating-point arithmetic is used to compute gradients.
It is highly unlikely to find ReLU neural networks that maintain, in the course of training with gradient descent, superlinearly many affine pieces with respect to their number of layers.
We conclude that approximating sequences of ReLU neural networks resulting from gradient descent in practice differ substantially from theoretically constructed sequences.
arXiv Detail & Related papers (2022-10-03T10:34:38Z) - On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK)
In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z) - Going beyond p-convolutions to learn grayscale morphological operators [64.38361575778237]
We present two new morphological layers based on the same principle as the p-convolutional layer.
In this work, we present two new morphological layers based on the same principle as the p-convolutional layer.
arXiv Detail & Related papers (2021-02-19T17:22:16Z) - Infinitely Deep Bayesian Neural Networks with Stochastic Differential
Equations [37.02511585732081]
We perform scalable approximate inference in a recently-proposed family of continuous-depth neural networks.
We demonstrate gradient-based variational inference, producing arbitrarily-flexible approximate posteriors.
This approach further inherits the memory-efficient training and tunable precision of neural ODEs.
arXiv Detail & Related papers (2021-02-12T14:48:58Z) - Multipole Graph Neural Operator for Parametric Partial Differential
Equations [57.90284928158383]
One of the main challenges in using deep learning-based methods for simulating physical systems is formulating physics-based data.
We propose a novel multi-level graph neural network framework that captures interaction at all ranges with only linear complexity.
Experiments confirm our multi-graph network learns discretization-invariant solution operators to PDEs and can be evaluated in linear time.
arXiv Detail & Related papers (2020-06-16T21:56:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.