Related papers: Early Period of Training Impacts Out-of-Distribution Generalization

Early Period of Training Impacts Out-of-Distribution Generalization

URL: http://arxiv.org/abs/2403.15210v1
Date: Fri, 22 Mar 2024 13:52:53 GMT
Title: Early Period of Training Impacts Out-of-Distribution Generalization
Authors: Chen Cecilia Liu, Iryna Gurevych,
Abstract summary: We investigate the relationship between learning dynamics and OOD generalization during the early period of neural network training. We show that selecting the number of trainable parameters at different times during training has a minuscule impact on ID results. The absolute values of sharpness and trace of Fisher Information at the initial period of training are not indicative for OOD generalization.
Score: 56.283944756315066
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Prior research has found that differences in the early period of neural network training significantly impact the performance of in-distribution (ID) tasks. However, neural networks are often sensitive to out-of-distribution (OOD) data, making them less reliable in downstream applications. Yet, the impact of the early training period on OOD generalization remains understudied due to its complexity and lack of effective analytical methodologies. In this work, we investigate the relationship between learning dynamics and OOD generalization during the early period of neural network training. We utilize the trace of Fisher Information and sharpness, with a focus on gradual unfreezing (i.e. progressively unfreezing parameters during training) as the methodology for investigation. Through a series of empirical experiments, we show that 1) selecting the number of trainable parameters at different times during training, i.e. realized by gradual unfreezing -- has a minuscule impact on ID results, but greatly affects the generalization to OOD data; 2) the absolute values of sharpness and trace of Fisher Information at the initial period of training are not indicative for OOD generalization, but the relative values could be; 3) the trace of Fisher Information and sharpness may be used as indicators for the removal of interventions during early period of training for better OOD generalization.

Related papers

Overcoming catastrophic forgetting in neural networks [0.0]
Catastrophic forgetting is the primary challenge that hinders continual learning.<n> Elastic Weight Consolidation is a regularization-based approach inspired by synaptic consolidation in biological neural systems.<n>Our results confirm what was shown in previous research, showing that EWC significantly reduces forgetting compared to naive training.
arXiv Detail & Related papers (2025-07-14T17:04:05Z)
Normalization and effective learning rates in reinforcement learning [52.59508428613934]
Normalization layers have recently experienced a renaissance in the deep reinforcement learning and continual learning literature. We show that normalization brings with it a subtle but important side effect: an equivalence between growth in the norm of the network parameters and decay in the effective learning rate. We propose to make the learning rate schedule explicit with a simple re- parameterization which we call Normalize-and-Project.
arXiv Detail & Related papers (2024-07-01T20:58:01Z)
Gradient-Regularized Out-of-Distribution Detection [28.542499196417214]
One of the challenges for neural networks in real-life applications is the overconfident errors these models make when the data is not from the original training distribution. We propose the idea of leveraging the information embedded in the gradient of the loss function during training to enable the network to learn a desired OOD score for each sample. We also develop a novel energy-based sampling method to allow the network to be exposed to more informative OOD samples during the training phase.
arXiv Detail & Related papers (2024-04-18T17:50:23Z)
Dissecting Deep RL with High Update Ratios: Combatting Value Divergence [21.282292112642747]
We show that deep reinforcement learning algorithms can retain their ability to learn without resetting network parameters. We employ a simple unit-ball normalization that enables learning under large update ratios.
arXiv Detail & Related papers (2024-03-09T19:56:40Z)
Understanding the Generalization Benefits of Late Learning Rate Decay [14.471831651042367]
We show the relation between training and testing loss in neural networks. We introduce a nonlinear model whose loss landscapes mirror those observed for real neural networks. We demonstrate that an extended phase with a large learning rate steers our model towards the minimum norm solution of the training loss.
arXiv Detail & Related papers (2024-01-21T21:11:09Z)
Mixture Data for Training Cannot Ensure Out-of-distribution Generalization [21.801115344132114]
We show that increasing the size of training data does not always lead to a reduction in the test generalization error. In this work, we quantitatively redefine OOD data as those situated outside the convex hull of mixed training data. Our proof of the new risk bound agrees that the efficacy of well-trained models can be guaranteed for unseen data.
arXiv Detail & Related papers (2023-12-25T11:00:38Z)
Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization [36.72245290832128]
We identify a new phenomenon in neural network optimization which arises from the interaction of depth and a heavytailed structure in natural data. In particular, it implies a conceptually new cause for progressive sharpening and the edge of stability. We demonstrate the significant influence of paired groups of outliers in the training data with strong opposing signals.
arXiv Detail & Related papers (2023-11-07T17:43:50Z)
Can Pre-trained Networks Detect Familiar Out-of-Distribution Data? [37.36999826208225]
We study the effect of PT-OOD on the OOD detection performance of pre-trained networks. We find that the low linear separability of PT-OOD in the feature space heavily degrades the PT-OOD detection performance. We propose a unique solution to large-scale pre-trained models: Leveraging powerful instance-by-instance discriminative representations of pre-trained models.
arXiv Detail & Related papers (2023-10-02T02:01:00Z)
LINe: Out-of-Distribution Detection by Leveraging Important Neurons [15.797257361788812]
We introduce a new aspect for analyzing the difference in model outputs between in-distribution data and OOD data. We propose a novel method, Leveraging Important Neurons (LINe), for post-hoc Out of distribution detection.
arXiv Detail & Related papers (2023-03-24T13:49:05Z)
Pseudo-OOD training for robust language models [78.15712542481859]
OOD detection is a key component of a reliable machine-learning model for any industry-scale application. We propose POORE - POsthoc pseudo-Ood REgularization, that generates pseudo-OOD samples using in-distribution (IND) data. We extensively evaluate our framework on three real-world dialogue systems, achieving new state-of-the-art in OOD detection.
arXiv Detail & Related papers (2022-10-17T14:32:02Z)
Critical Learning Periods for Multisensory Integration in Deep Networks [112.40005682521638]
We show that the ability of a neural network to integrate information from diverse sources hinges critically on being exposed to properly correlated signals during the early phases of training. We show that critical periods arise from the complex and unstable early transient dynamics, which are decisive of final performance of the trained system and their learned representations.
arXiv Detail & Related papers (2022-10-06T23:50:38Z)
Learning Dynamics and Generalization in Reinforcement Learning [59.530058000689884]
We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training. We show that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly networks and gradient networks trained with policy methods.
arXiv Detail & Related papers (2022-06-05T08:49:16Z)
Do Deep Neural Networks Always Perform Better When Eating More Data? [82.6459747000664]
We design experiments from Identically Independent Distribution(IID) and Out of Distribution(OOD) Under IID condition, the amount of information determines the effectivity of each sample, the contribution of samples and difference between classes determine the amount of class information. Under OOD condition, the cross-domain degree of samples determine the contributions, and the bias-fitting caused by irrelevant elements is a significant factor of cross-domain.
arXiv Detail & Related papers (2022-05-30T15:40:33Z)
Learning Curves for Sequential Training of Neural Networks: Self-Knowledge Transfer and Forgetting [9.734033555407406]
We consider neural networks in the neural tangent kernel regime that continually learn target functions from task to task. We investigate a variant of continual learning where the model learns the same target function in multiple tasks. Even for the same target, the trained model shows some transfer and forgetting depending on the sample size of each task.
arXiv Detail & Related papers (2021-12-03T00:25:01Z)
On the Impact of Spurious Correlation for Out-of-distribution Detection [14.186776881154127]
We present a new formalization and model the data shifts by taking into account both the invariant and environmental features. Our results suggest that the detection performance is severely worsened when the correlation between spurious features and labels is increased in the training set.
arXiv Detail & Related papers (2021-09-12T23:58:17Z)
Improved OOD Generalization via Adversarial Training and Pre-training [49.08683910076778]
In this paper, we theoretically show that a model robust to input perturbations generalizes well on OOD data. Inspired by previous findings that adversarial training helps improve input-robustness, we show that adversarially trained models have converged excess risk on OOD data.
arXiv Detail & Related papers (2021-05-24T08:06:35Z)
Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization [111.57403811375484]
We show that gradient descent implicitly penalizes the trace of the Fisher Information Matrix from the beginning of training. We highlight that in the absence of implicit or explicit regularization, the trace of the FIM can increase to a large value early in training.
arXiv Detail & Related papers (2020-12-28T11:17:46Z)
Learn what you can't learn: Regularized Ensembles for Transductive Out-of-distribution Detection [76.39067237772286]
We show that current out-of-distribution (OOD) detection algorithms for neural networks produce unsatisfactory results in a variety of OOD detection scenarios. This paper studies how such "hard" OOD scenarios can benefit from adjusting the detection method after observing a batch of the test data. We propose a novel method that uses an artificial labeling scheme for the test data and regularization to obtain ensembles of models that produce contradictory predictions only on the OOD samples in a test batch.
arXiv Detail & Related papers (2020-12-10T16:55:13Z)
Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z)
Understanding the Role of Training Regimes in Continual Learning [51.32945003239048]
Catastrophic forgetting affects the training of neural networks, limiting their ability to learn multiple tasks sequentially. We study the effect of dropout, learning rate decay, and batch size, on forming training regimes that widen the tasks' local minima.
arXiv Detail & Related papers (2020-06-12T06:00:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.