Deep Exploration of Epoch-wise Double Descent in Noisy Data: Signal Separation, Large Activation, and Benign Overfitting
- URL: http://arxiv.org/abs/2601.08316v1
- Date: Tue, 13 Jan 2026 08:13:15 GMT
- Title: Deep Exploration of Epoch-wise Double Descent in Noisy Data: Signal Separation, Large Activation, and Benign Overfitting
- Authors: Tomoki Kubo, Ryuken Uda, Yusuke Iida,
- Abstract summary: "Deep double descent" is one of the key phenomena underlying the generalization capability of deep learning models.<n>In this study, epoch-wise double descent was investigated by focusing on the evolution of internal structures.<n>Results: The model achieved strong re-generalization on test data even after perfectly fitting noisy training data during the double descent phase.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep double descent is one of the key phenomena underlying the generalization capability of deep learning models. In this study, epoch-wise double descent, which is delayed generalization following overfitting, was empirically investigated by focusing on the evolution of internal structures. Fully connected neural networks of three different sizes were trained on the CIFAR-10 dataset with 30% label noise. By decomposing the loss curves into signal contributions from clean and noisy training data, the epoch-wise evolutions of internal signals were analyzed separately. Three main findings were obtained from this analysis. First, the model achieved strong re-generalization on test data even after perfectly fitting noisy training data during the double descent phase, corresponding to a "benign overfitting" state. Second, noisy data were learned after clean data, and as learning progressed, their corresponding internal activations became increasingly separated in outer layers; this enabled the model to overfit only noisy data. Third, a single, very large activation emerged in the shallow layer across all models; this phenomenon is referred as "outliers," "massive activa-tions," and "super activations" in recent large language models and evolves with re-generalization. The magnitude of large activation correlated with input patterns but not with output patterns. These empirical findings directly link the recent key phenomena of "deep double descent," "benign overfitting," and "large activation", and support the proposal of a novel scenario for understanding deep double descent.
Related papers
- Understanding Embedding Scaling in Collaborative Filtering [12.221835332469228]
We conduct large-scale experiments across 10 datasets with varying sparsity levels and scales.<n>We observe two novel phenomena: double-peak and logarithmic.<n>We gain an understanding of the underlying causes of the double-peak phenomenon.
arXiv Detail & Related papers (2025-09-19T07:33:50Z) - Unveiling Multiple Descents in Unsupervised Autoencoders [25.244065166421517]
We show for the first time that double and triple descent can be observed with nonlinear unsupervised autoencoders.<n>Through extensive experiments on both synthetic and real datasets, we uncover model-wise, epoch-wise, and sample-wise double descent.
arXiv Detail & Related papers (2024-06-17T16:24:23Z) - Understanding the Double Descent Phenomenon in Deep Learning [49.1574468325115]
This tutorial sets the classical statistical learning framework and introduces the double descent phenomenon.
By looking at a number of examples, section 2 introduces inductive biases that appear to have a key role in double descent by selecting.
section 3 explores the double descent with two linear models, and gives other points of view from recent related works.
arXiv Detail & Related papers (2024-03-15T16:51:24Z) - Unraveling the Enigma of Double Descent: An In-depth Analysis through the Lens of Learned Feature Space [12.907949196758565]
Double descent presents a counter-intuitive aspect within the machine learning domain.
We argue that double descent arises in imperfect models trained with noisy data.
arXiv Detail & Related papers (2023-10-20T15:10:16Z) - Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum.
Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels.
They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z) - Deep Double Descent via Smooth Interpolation [2.141079906482723]
We quantify sharpness of fit of training data by studying the loss landscape w.r.t. to the input variable locally to each training point.
Our findings show that loss sharpness in the input space follows both model- and epoch-wise double descent, with worse peaks observed around noisy targets.
While small interpolating models sharply fit both clean and noisy data, large interpolating models express a smooth loss landscape, in contrast to existing intuition.
arXiv Detail & Related papers (2022-09-21T02:46:13Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - When and how epochwise double descent happens [7.512375012141203]
An epochwise double descent' effect exists in which the generalization error initially drops, then rises, and finally drops again with increasing training time.
This presents a practical problem in that the amount of time required for training is long, and early stopping based on validation performance may result in suboptimal generalization.
We show that epochwise double descent requires a critical amount of noise to occur, but above a second critical noise level early stopping remains effective.
arXiv Detail & Related papers (2021-08-26T19:19:17Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Dataset Cartography: Mapping and Diagnosing Datasets with Training
Dynamics [118.75207687144817]
We introduce Data Maps, a model-based tool to characterize and diagnose datasets.
We leverage a largely ignored source of information: the behavior of the model on individual instances during training.
Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.
arXiv Detail & Related papers (2020-09-22T20:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.