Understanding and Minimising Outlier Features in Neural Network Training
- URL: http://arxiv.org/abs/2405.19279v1
- Date: Wed, 29 May 2024 17:11:28 GMT
- Title: Understanding and Minimising Outlier Features in Neural Network Training
- Authors: Bobby He, Lorenzo Noci, Daniele Paliotta, Imanol Schlag, Thomas Hofmann,
- Abstract summary: Outlier Features (OF) are neurons whose activation magnitudes significantly exceed the average over a neural network's (NN) width.
We study how architectural and optimisation choices influence OFs, and provide practical insights to minimise OFs during training.
- Score: 33.980628229566555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Outlier Features (OF) are neurons whose activation magnitudes significantly exceed the average over a neural network's (NN) width. They are well known to emerge during standard transformer training and have the undesirable effect of hindering quantisation in afflicted models. Despite their practical importance, little is known behind why OFs emerge during training, nor how one can minimise them. Our work focuses on the above questions, first identifying several quantitative metrics, such as the kurtosis over neuron activation norms, to measure OFs. With these metrics, we study how architectural and optimisation choices influence OFs, and provide practical insights to minimise OFs during training. As highlights, we emphasise the importance of controlling signal propagation throughout training, and propose the Outlier Protected transformer block, which removes standard Pre-Norm layers to mitigate OFs, without loss of convergence speed or training stability. Overall, our findings shed new light on our understanding of, our ability to prevent, and the complexity of this important facet in NN training dynamics.
Related papers
- Discovering Long-Term Effects on Parameter Efficient Fine-tuning [36.83255498301937]
Pre-trained Artificial Neural Networks (Annns) exhibit robust pattern recognition capabilities.
Annns and BNNs share extensive similarities with the human brain, specifically Biological Neural Networks (BNNs)
Annns can acquire new knowledge through fine-tuning.
arXiv Detail & Related papers (2024-08-24T03:27:29Z) - Post-Training Overfitting Mitigation in DNN Classifiers [31.513866929577336]
We show that post-training MM-based regularization substantially mitigates non-malicious overfitting due to class imbalances and overtraining.
Unlike adversarial training, which provides some resilience against attacks but which harms clean (attack-free) generalization, we demonstrate an approach originating from adversarial learning.
arXiv Detail & Related papers (2023-09-28T20:16:24Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Training Integrable Parameterizations of Deep Neural Networks in the
Infinite-Width Limit [0.0]
Large-width dynamics has emerged as a fruitful viewpoint and led to practical insights on real-world deep networks.
For two-layer neural networks, it has been understood that the nature of the trained model radically changes depending on the scale of the initial random weights.
We propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics.
arXiv Detail & Related papers (2021-10-29T07:53:35Z) - Dynamic Neural Diversification: Path to Computationally Sustainable
Neural Networks [68.8204255655161]
Small neural networks with a constrained number of trainable parameters, can be suitable resource-efficient candidates for many simple tasks.
We explore the diversity of the neurons within the hidden layer during the learning process.
We analyze how the diversity of the neurons affects predictions of the model.
arXiv Detail & Related papers (2021-09-20T15:12:16Z) - ActNN: Reducing Training Memory Footprint via 2-Bit Activation
Compressed Training [68.63354877166756]
ActNN is a memory-efficient training framework that stores randomly quantized activations for back propagation.
ActNN reduces the memory footprint of the activation by 12x, and it enables training with a 6.6x to 14x larger batch size.
arXiv Detail & Related papers (2021-04-29T05:50:54Z) - Attribute-Guided Adversarial Training for Robustness to Natural
Perturbations [64.35805267250682]
We propose an adversarial training approach which learns to generate new samples so as to maximize exposure of the classifier to the attributes-space.
Our approach enables deep neural networks to be robust against a wide range of naturally occurring perturbations.
arXiv Detail & Related papers (2020-12-03T10:17:30Z) - A Fully Tensorized Recurrent Neural Network [48.50376453324581]
We introduce a "fully tensorized" RNN architecture which jointly encodes the separate weight matrices within each recurrent cell.
This approach reduces model size by several orders of magnitude, while still maintaining similar or better performance compared to standard RNNs.
arXiv Detail & Related papers (2020-10-08T18:24:12Z) - Feature Purification: How Adversarial Training Performs Robust Deep
Learning [66.05472746340142]
We show a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network.
We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly gradient descent indeed this principle.
arXiv Detail & Related papers (2020-05-20T16:56:08Z) - Entropy-Based Modeling for Estimating Soft Errors Impact on Binarized
Neural Network Inference [2.249916681499244]
We present the relatively-accurate statistical models to delineate the impact of both undertaken single-event upset (SEU) and multi-bit upset (MBU) across layers and per each layer of the selected convolution neural network.
These models can be used for evaluating the error-resiliency magnitude of NN topology before adopting them in the safety-critical applications.
arXiv Detail & Related papers (2020-04-10T16:10:24Z) - Analyzing Redundancy in Pretrained Transformer Models [41.07850306314594]
We define a notion of Redundancy, which we categorize into two classes: General Redundancy and Task-specific Redundancy.
We present an efficient feature-based transfer learning procedure, which maintains 97% performance while using at-most 10% of the original neurons.
arXiv Detail & Related papers (2020-04-08T14:29:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.