Related papers: The boundary of neural network trainability is fractal

The boundary of neural network trainability is fractal

URL: http://arxiv.org/abs/2402.06184v1
Date: Fri, 9 Feb 2024 04:46:48 GMT
Title: The boundary of neural network trainability is fractal
Authors: Jascha Sohl-Dickstein
Abstract summary: Some fractals are computed by iterating a function. Neural network training can result in convergent or divergent behavior. We find that this boundary is fractal over more than ten decades of scale in all tested configurations.
Score: 23.4886323538853
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which the resulting series diverges or remains bounded. Neural network training similarly involves iterating an update function (e.g. repeated steps of gradient descent), can result in convergent or divergent behavior, and can be extremely sensitive to small changes in hyperparameters. Motivated by these similarities, we experimentally examine the boundary between neural network hyperparameters that lead to stable and divergent training. We find that this boundary is fractal over more than ten decades of scale in all tested configurations.

Related papers

Mapping the Edge of Chaos: Fractal-Like Boundaries in The Trainability of Decoder-Only Transformer Models [0.0]
Recent evidence from miniature neural networks suggests that the boundary separating these outcomes displays fractal characteristics. This study extends them to medium-sized, decoder-only transformer architectures by employing a more consistent convergence measure. The results show that the trainability frontier is not a simple threshold; rather, it forms a self-similar yet seemingly random structure at multiple scales.
arXiv Detail & Related papers (2025-01-08T05:24:11Z)
Topological obstruction to the training of shallow ReLU neural networks [0.0]
We study the interplay between the geometry of the loss landscape and the optimization trajectories of simple neural networks. This paper reveals the presence of topological obstruction in the loss landscape of shallow ReLU neural networks trained using gradient flow.
arXiv Detail & Related papers (2024-10-18T19:17:48Z)
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape. This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z)
Complex fractal trainability boundary can arise from trivial non-convexity [0.13597551064547497]
We investigate the lossability properties that might lead to train fractal boundaries. We identify "roughness perturbation", measures the gradient's sensitivity parameter changes. Recent findings will lead to more consistent and predictable training strategies.
arXiv Detail & Related papers (2024-06-20T03:31:28Z)
Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels [78.6096486885658]
We introduce lower bounds to the linearized Laplace approximation of the marginal likelihood. These bounds are amenable togradient-based optimization and allow to trade off estimation accuracy against computational complexity.
arXiv Detail & Related papers (2023-06-06T19:02:57Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Limitations of neural network training due to numerical instability of backpropagation [2.255961793913651]
We study the training of deep neural networks by gradient descent where floating-point arithmetic is used to compute gradients. It is highly unlikely to find ReLU neural networks that maintain, in the course of training with gradient descent, superlinearly many affine pieces with respect to their number of layers. We conclude that approximating sequences of ReLU neural networks resulting from gradient descent in practice differ substantially from theoretically constructed sequences.
arXiv Detail & Related papers (2022-10-03T10:34:38Z)
On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK) In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z)
Training Integrable Parameterizations of Deep Neural Networks in the Infinite-Width Limit [0.0]
Large-width dynamics has emerged as a fruitful viewpoint and led to practical insights on real-world deep networks. For two-layer neural networks, it has been understood that the nature of the trained model radically changes depending on the scale of the initial random weights. We propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics.
arXiv Detail & Related papers (2021-10-29T07:53:35Z)
Going beyond p-convolutions to learn grayscale morphological operators [64.38361575778237]
We present two new morphological layers based on the same principle as the p-convolutional layer. In this work, we present two new morphological layers based on the same principle as the p-convolutional layer.
arXiv Detail & Related papers (2021-02-19T17:22:16Z)
Infinitely Deep Bayesian Neural Networks with Stochastic Differential Equations [37.02511585732081]
We perform scalable approximate inference in a recently-proposed family of continuous-depth neural networks. We demonstrate gradient-based variational inference, producing arbitrarily-flexible approximate posteriors. This approach further inherits the memory-efficient training and tunable precision of neural ODEs.
arXiv Detail & Related papers (2021-02-12T14:48:58Z)
Multipole Graph Neural Operator for Parametric Partial Differential Equations [57.90284928158383]
One of the main challenges in using deep learning-based methods for simulating physical systems is formulating physics-based data. We propose a novel multi-level graph neural network framework that captures interaction at all ranges with only linear complexity. Experiments confirm our multi-graph network learns discretization-invariant solution operators to PDEs and can be evaluated in linear time.
arXiv Detail & Related papers (2020-06-16T21:56:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.