Complex fractal trainability boundary can arise from trivial non-convexity
- URL: http://arxiv.org/abs/2406.13971v1
- Date: Thu, 20 Jun 2024 03:31:28 GMT
- Title: Complex fractal trainability boundary can arise from trivial non-convexity
- Authors: Yizhou Liu,
- Abstract summary: We investigate the lossability properties that might lead to train fractal boundaries.
We identify "roughness perturbation", measures the gradient's sensitivity parameter changes.
Recent findings will lead to more consistent and predictable training strategies.
- Score: 0.13597551064547497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training neural networks involves optimizing parameters to minimize a loss function, where the nature of the loss function and the optimization strategy are crucial for effective training. Hyperparameter choices, such as the learning rate in gradient descent (GD), significantly affect the success and speed of convergence. Recent studies indicate that the boundary between bounded and divergent hyperparameters can be fractal, complicating reliable hyperparameter selection. However, the nature of this fractal boundary and methods to avoid it remain unclear. In this study, we focus on GD to investigate the loss landscape properties that might lead to fractal trainability boundaries. We discovered that fractal boundaries can emerge from simple non-convex perturbations, i.e., adding or multiplying cosine type perturbations to quadratic functions. The observed fractal dimensions are influenced by factors like parameter dimension, type of non-convexity, perturbation wavelength, and perturbation amplitude. Our analysis identifies "roughness of perturbation", which measures the gradient's sensitivity to parameter changes, as the factor controlling fractal dimensions of trainability boundaries. We observed a clear transition from non-fractal to fractal trainability boundaries as roughness increases, with the critical roughness causing the perturbed loss function non-convex. Thus, we conclude that fractal trainability boundaries can arise from very simple non-convexity. We anticipate that our findings will enhance the understanding of complex behaviors during neural network training, leading to more consistent and predictable training strategies.
Related papers
- Mapping the Edge of Chaos: Fractal-Like Boundaries in The Trainability of Decoder-Only Transformer Models [0.0]
Recent evidence from miniature neural networks suggests that the boundary separating these outcomes displays fractal characteristics.
This study extends them to medium-sized, decoder-only transformer architectures by employing a more consistent convergence measure.
The results show that the trainability frontier is not a simple threshold; rather, it forms a self-similar yet seemingly random structure at multiple scales.
arXiv Detail & Related papers (2025-01-08T05:24:11Z) - Dissecting a Small Artificial Neural Network [0.0]
We investigate the loss landscape and backpropagation dynamics of convergence for the simplest possible artificial neural network representing the logical exclusive-OR (XOR) gate.
Cross-sections of the loss landscape in the nine-dimensional parameter space are found to exhibit distinct features, which help understand why backpropagation achieves convergence toward zero loss.
arXiv Detail & Related papers (2025-01-03T21:14:46Z) - Topological obstruction to the training of shallow ReLU neural networks [0.0]
We study the interplay between the geometry of the loss landscape and the optimization trajectories of simple neural networks.
This paper reveals the presence of topological obstruction in the loss landscape of shallow ReLU neural networks trained using gradient flow.
arXiv Detail & Related papers (2024-10-18T19:17:48Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - The boundary of neural network trainability is fractal [23.4886323538853]
Some fractals are computed by iterating a function.
Neural network training can result in convergent or divergent behavior.
We find that this boundary is fractal over more than ten decades of scale in all tested configurations.
arXiv Detail & Related papers (2024-02-09T04:46:48Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels [78.6096486885658]
We introduce lower bounds to the linearized Laplace approximation of the marginal likelihood.
These bounds are amenable togradient-based optimization and allow to trade off estimation accuracy against computational complexity.
arXiv Detail & Related papers (2023-06-06T19:02:57Z) - Data-Driven Influence Functions for Optimization-Based Causal Inference [105.5385525290466]
We study a constructive algorithm that approximates Gateaux derivatives for statistical functionals by finite differencing.
We study the case where probability distributions are not known a priori but need to be estimated from data.
arXiv Detail & Related papers (2022-08-29T16:16:22Z) - On Convergence of Training Loss Without Reaching Stationary Points [62.41370821014218]
We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes.
We propose a new perspective based on ergodic theory dynamical systems.
arXiv Detail & Related papers (2021-10-12T18:12:23Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Asymptotic convergence rate of Dropout on shallow linear neural networks [0.0]
We analyze the convergence on objective functions induced by Dropout and Dropconnect, when applying them to shallow linear Neural Networks.
We obtain a local convergence proof of the gradient flow and a bound on the rate that depends on the data, the rate probability, and the width of the NN.
arXiv Detail & Related papers (2020-12-01T19:02:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.