The Multiscale Structure of Neural Network Loss Functions: The Effect on
Optimization and Origin
- URL: http://arxiv.org/abs/2204.11326v1
- Date: Sun, 24 Apr 2022 17:34:12 GMT
- Title: The Multiscale Structure of Neural Network Loss Functions: The Effect on
Optimization and Origin
- Authors: Chao Ma, Lei Wu, Lexing Ying
- Abstract summary: We study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of good quadratic approximation.
We show that training data with different magnitudes give rise to different scales of the loss function, producing subquadratic growth or multiple separate scales.
- Score: 12.092361450994318
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Local quadratic approximation has been extensively used to study the
optimization of neural network loss functions around the minimum. Though, it
usually holds in a very small neighborhood of the minimum, and cannot explain
many phenomena observed during the optimization process. In this work, we study
the structure of neural network loss functions and its implication on
optimization in a region beyond the reach of good quadratic approximation.
Numerically, we observe that neural network loss functions possesses a
multiscale structure, manifested in two ways: (1) in a neighborhood of minima,
the loss mixes a continuum of scales and grows subquadratically, and (2) in a
larger region, the loss shows several separate scales clearly. Using the
subquadratic growth, we are able to explain the Edge of Stability phenomenon[4]
observed for gradient descent (GD) method. Using the separate scales, we
explain the working mechanism of learning rate decay by simple examples.
Finally, we study the origin of the multiscale structure and propose that the
non-uniformity of training data is one of its cause. By constructing a
two-layer neural network problem we show that training data with different
magnitudes give rise to different scales of the loss function, producing
subquadratic growth or multiple separate scales.
Related papers
- On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Accelerated Neural Network Training with Rooted Logistic Objectives [13.400503928962756]
We derive a novel sequence of em strictly convex functions that are at least as strict as logistic loss.
Our results illustrate that training with rooted loss function is converged faster and gains performance improvements.
arXiv Detail & Related papers (2023-10-05T20:49:48Z) - A simple connection from loss flatness to compressed representations in neural networks [3.5502600490147196]
We show that in the final phase of learning in deep neural networks, the compression of the manifold of neural representations correlates with the flatness of the loss around the minima explored by SGD.
Our work builds upon the linear stability insight by Ma and Ying, deriving inequalities between various compression metrics and quantities involving sharpness.
arXiv Detail & Related papers (2023-10-03T03:36:29Z) - Early Stage Convergence and Global Convergence of Training Mildly
Parameterized Neural Networks [3.148524502470734]
We show that the loss is decreased by a significant amount in the early stage of the training, and this decrease is fast.
We use a microscopic analysis of the activation patterns for the neurons, which helps us derive more powerful lower bounds for the gradient.
arXiv Detail & Related papers (2022-06-05T09:56:50Z) - Phenomenology of Double Descent in Finite-Width Neural Networks [29.119232922018732]
Double descent delineates the behaviour of models depending on the regime they belong to.
We use influence functions to derive suitable expressions of the population loss and its lower bound.
Building on our analysis, we investigate how the loss function affects double descent.
arXiv Detail & Related papers (2022-03-14T17:39:49Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations.
This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Mitigating Performance Saturation in Neural Marked Point Processes:
Architectures and Loss Functions [50.674773358075015]
We propose a simple graph-based network structure called GCHP, which utilizes only graph convolutional layers.
We show that GCHP can significantly reduce training time and the likelihood ratio loss with interarrival time probability assumptions can greatly improve the model performance.
arXiv Detail & Related papers (2021-07-07T16:59:14Z) - Topological obstructions in neural networks learning [67.8848058842671]
We study global properties of the loss gradient function flow.
We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface.
arXiv Detail & Related papers (2020-12-31T18:53:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.