Outliers with Opposing Signals Have an Outsized Effect on Neural Network
Optimization
- URL: http://arxiv.org/abs/2311.04163v1
- Date: Tue, 7 Nov 2023 17:43:50 GMT
- Title: Outliers with Opposing Signals Have an Outsized Effect on Neural Network
Optimization
- Authors: Elan Rosenfeld, Andrej Risteski
- Abstract summary: We identify a new phenomenon in neural network optimization which arises from the interaction of depth and a heavytailed structure in natural data.
In particular, it implies a conceptually new cause for progressive sharpening and the edge of stability.
We demonstrate the significant influence of paired groups of outliers in the training data with strong opposing signals.
- Score: 36.72245290832128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We identify a new phenomenon in neural network optimization which arises from
the interaction of depth and a particular heavy-tailed structure in natural
data. Our result offers intuitive explanations for several previously reported
observations about network training dynamics. In particular, it implies a
conceptually new cause for progressive sharpening and the edge of stability; we
also highlight connections to other concepts in optimization and generalization
including grokking, simplicity bias, and Sharpness-Aware Minimization.
Experimentally, we demonstrate the significant influence of paired groups of
outliers in the training data with strong opposing signals: consistent, large
magnitude features which dominate the network output throughout training and
provide gradients which point in opposite directions. Due to these outliers,
early optimization enters a narrow valley which carefully balances the opposing
groups; subsequent sharpening causes their loss to rise rapidly, oscillating
between high on one group and then the other, until the overall loss spikes. We
describe how to identify these groups, explore what sets them apart, and
carefully study their effect on the network's optimization and behavior. We
complement these experiments with a mechanistic explanation on a toy example of
opposing signals and a theoretical analysis of a two-layer linear network on a
simple model. Our finding enables new qualitative predictions of training
behavior which we confirm experimentally. It also provides a new lens through
which to study and improve modern training practices for stochastic
optimization, which we highlight via a case study of Adam versus SGD.
Related papers
- Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond [61.18736646013446]
In pursuit of a deeper understanding of its surprising behaviors, we investigate the utility of a simple yet accurate model of a trained neural network.
Across three case studies, we illustrate how it can be applied to derive new empirical insights on a diverse range of prominent phenomena.
arXiv Detail & Related papers (2024-10-31T22:54:34Z) - Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data [38.44734564565478]
We provide a theoretical understanding of adversarial examples and adversarial training algorithms from the perspective of feature learning theory.
We show that the adversarial training method can provably strengthen the robust feature learning and suppress the non-robust feature learning.
arXiv Detail & Related papers (2024-10-11T03:59:49Z) - Improving Network Interpretability via Explanation Consistency Evaluation [56.14036428778861]
We propose a framework that acquires more explainable activation heatmaps and simultaneously increase the model performance.
Specifically, our framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning.
Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations.
arXiv Detail & Related papers (2024-08-08T17:20:08Z) - Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy [75.15685966213832]
We analyze the rich directional structure of optimization trajectories represented by their pointwise parameters.
We show that training only scalar batchnorm parameters some while into training matches the performance of training the entire network.
arXiv Detail & Related papers (2024-03-12T07:32:47Z) - No Wrong Turns: The Simple Geometry Of Neural Networks Optimization
Paths [12.068608358926317]
First-order optimization algorithms are known to efficiently locate favorable minima in deep neural networks.
We focus on the fundamental geometric properties of sampled quantities of optimization on two key paths.
Our findings suggest that not only do optimization trajectories never encounter significant obstacles, but they also maintain stable dynamics during the majority of training.
arXiv Detail & Related papers (2023-06-20T22:10:40Z) - Towards Understanding the Dynamics of the First-Order Adversaries [40.54670072901657]
An acknowledged weakness of neural networks is their vulnerability to adversarial perturbations to the inputs.
One of the most popular defense mechanisms is to maximize the loss over the constrained perturbations on the inputs using projected ascent and minimize over weights.
We investigate the non-concave landscape of the adversaries for a two-layer neural network with a quadratic loss.
arXiv Detail & Related papers (2020-10-20T22:20:53Z) - On Robustness and Transferability of Convolutional Neural Networks [147.71743081671508]
Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts.
We study the interplay between out-of-distribution and transfer performance of modern image classification CNNs for the first time.
We find that increasing both the training set and model sizes significantly improve the distributional shift robustness.
arXiv Detail & Related papers (2020-07-16T18:39:04Z) - Improving Adversarial Robustness by Enforcing Local and Global
Compactness [19.8818435601131]
Adversary training is the most successful method that consistently resists a wide range of attacks.
We propose the Adversary Divergence Reduction Network which enforces local/global compactness and the clustering assumption.
The experimental results demonstrate that augmenting adversarial training with our proposed components can further improve the robustness of the network.
arXiv Detail & Related papers (2020-07-10T00:43:06Z) - Understanding the Effects of Data Parallelism and Sparsity on Neural
Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity.
Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z) - The large learning rate phase of deep learning: the catapult mechanism [50.23041928811575]
We present a class of neural networks with solvable training dynamics.
We find good agreement between our model's predictions and training dynamics in realistic deep learning settings.
We believe our results shed light on characteristics of models trained at different learning rates.
arXiv Detail & Related papers (2020-03-04T17:52:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.