Related papers: The Multiscale Structure of Neural Network Loss Functions: The Effect on Optimization and Origin

The Multiscale Structure of Neural Network Loss Functions: The Effect on Optimization and Origin

URL: http://arxiv.org/abs/2204.11326v1
Date: Sun, 24 Apr 2022 17:34:12 GMT
Title: The Multiscale Structure of Neural Network Loss Functions: The Effect on Optimization and Origin
Authors: Chao Ma, Lei Wu, Lexing Ying
Abstract summary: We study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of good quadratic approximation. We show that training data with different magnitudes give rise to different scales of the loss function, producing subquadratic growth or multiple separate scales.
Score: 12.092361450994318
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Local quadratic approximation has been extensively used to study the optimization of neural network loss functions around the minimum. Though, it usually holds in a very small neighborhood of the minimum, and cannot explain many phenomena observed during the optimization process. In this work, we study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of good quadratic approximation. Numerically, we observe that neural network loss functions possesses a multiscale structure, manifested in two ways: (1) in a neighborhood of minima, the loss mixes a continuum of scales and grows subquadratically, and (2) in a larger region, the loss shows several separate scales clearly. Using the subquadratic growth, we are able to explain the Edge of Stability phenomenon[4] observed for gradient descent (GD) method. Using the separate scales, we explain the working mechanism of learning rate decay by simple examples. Finally, we study the origin of the multiscale structure and propose that the non-uniformity of training data is one of its cause. By constructing a two-layer neural network problem we show that training data with different magnitudes give rise to different scales of the loss function, producing subquadratic growth or multiple separate scales.

Related papers

Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks [59.552873049024775]
We show that compute-optimally trained models exhibit a remarkably precise universality.<n>With learning rate decay, the collapse becomes so tight that differences in the normalized curves across models fall below the noise floor.<n>We explain these phenomena by connecting collapse to the power-law structure in typical neural scaling laws.
arXiv Detail & Related papers (2025-07-02T20:03:34Z)
The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions [51.68215326304272]
We show that even small perturbations reliably cause otherwise identical training trajectories to diverge-an effect that diminishes rapidly over training time.<n>Our findings provide insights into neural network training stability, with practical implications for fine-tuning, model merging, and diversity of model ensembles.
arXiv Detail & Related papers (2025-06-16T08:35:16Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Accelerated Neural Network Training with Rooted Logistic Objectives [13.400503928962756]
We derive a novel sequence of em strictly convex functions that are at least as strict as logistic loss. Our results illustrate that training with rooted loss function is converged faster and gains performance improvements.
arXiv Detail & Related papers (2023-10-05T20:49:48Z)
A simple connection from loss flatness to compressed representations in neural networks [3.5502600490147196]
We show that in the final phase of learning in deep neural networks, the compression of the manifold of neural representations correlates with the flatness of the loss around the minima explored by SGD. Our work builds upon the linear stability insight by Ma and Ying, deriving inequalities between various compression metrics and quantities involving sharpness.
arXiv Detail & Related papers (2023-10-03T03:36:29Z)
Early Stage Convergence and Global Convergence of Training Mildly Parameterized Neural Networks [3.148524502470734]
We show that the loss is decreased by a significant amount in the early stage of the training, and this decrease is fast. We use a microscopic analysis of the activation patterns for the neurons, which helps us derive more powerful lower bounds for the gradient.
arXiv Detail & Related papers (2022-06-05T09:56:50Z)
Phenomenology of Double Descent in Finite-Width Neural Networks [29.119232922018732]
Double descent delineates the behaviour of models depending on the regime they belong to. We use influence functions to derive suitable expressions of the population loss and its lower bound. Building on our analysis, we investigate how the loss function affects double descent.
arXiv Detail & Related papers (2022-03-14T17:39:49Z)
Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error. We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z)
Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations. This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z)
Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent. We show that SGD is biased towards a simple solution. We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z)
Mitigating Performance Saturation in Neural Marked Point Processes: Architectures and Loss Functions [50.674773358075015]
We propose a simple graph-based network structure called GCHP, which utilizes only graph convolutional layers. We show that GCHP can significantly reduce training time and the likelihood ratio loss with interarrival time probability assumptions can greatly improve the model performance.
arXiv Detail & Related papers (2021-07-07T16:59:14Z)
Topological obstructions in neural networks learning [67.8848058842671]
We study global properties of the loss gradient function flow. We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface.
arXiv Detail & Related papers (2020-12-31T18:53:25Z)
Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss [0.0]
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. We analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations.
arXiv Detail & Related papers (2020-02-11T15:42:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.