Why Loss Re-weighting Works If You Stop Early: Training Dynamics of Unconstrained Features
- URL: http://arxiv.org/abs/2601.12011v1
- Date: Sat, 17 Jan 2026 11:26:53 GMT
- Title: Why Loss Re-weighting Works If You Stop Early: Training Dynamics of Unconstrained Features
- Authors: Yize Zhao, Christos Thrampoulidis,
- Abstract summary: We introduce a small-scale model (SSM) to transparently demonstrate and analyze this phenomenon.<n>On the one hand, the SSM reveals how vanilla empirical risk minimization preferentially learns to distinguish majority classes over minorities early in training.<n>In stark contrast, reweighting restores balanced learning dynamics, enabling the simultaneous learning of features associated with both majorities and minorities.
- Score: 34.88156871518115
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The application of loss reweighting in modern deep learning presents a nuanced picture. While it fails to alter the terminal learning phase in overparameterized deep neural networks (DNNs) trained on high-dimensional datasets, empirical evidence consistently shows it offers significant benefits early in training. To transparently demonstrate and analyze this phenomenon, we introduce a small-scale model (SSM). This model is specifically designed to abstract the inherent complexities of both the DNN architecture and the input data, while maintaining key information about the structure of imbalance within its spectral components. On the one hand, the SSM reveals how vanilla empirical risk minimization preferentially learns to distinguish majority classes over minorities early in training, consequently delaying minority learning. In stark contrast, reweighting restores balanced learning dynamics, enabling the simultaneous learning of features associated with both majorities and minorities.
Related papers
- Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness [2.9327666088683664]
This paper introduces a unified feature-centric framework to analyze the feature learning dynamics of differentially private gradient.<n>We demonstrate that the noise required for privacy leads to suboptimal feature learning networks.
arXiv Detail & Related papers (2026-03-05T07:19:31Z) - Theoretical Analysis of Contrastive Learning under Imbalanced Data: From Training Dynamics to a Pruning Solution [33.0181633510156]
We develop a theoretical framework to analyze the training dynamics of contrastive learning with Transformer-based encoders under imbalanced data.<n>Our results reveal that neuron weights evolve through three distinct stages of training, with different dynamics for majority features, minority features, and noise.<n>Inspired by these neuron-level behaviors, we show that pruning restores performance degraded by imbalance and enhances feature separation, offering both conceptual insights and practical guidance.
arXiv Detail & Related papers (2026-02-10T23:06:12Z) - The Importance of Being Lazy: Scaling Limits of Continual Learning [60.97756735877614]
We show that increasing model width is only beneficial when it reduces the amount of feature learning, yielding more laziness.<n>We study the intricate relationship between feature learning, task non-stationarity, and forgetting, finding that high feature learning is only beneficial with highly similar tasks.
arXiv Detail & Related papers (2025-06-20T10:12:38Z) - Understanding Sharpness Dynamics in NN Training with a Minimalist Example: The Effects of Dataset Difficulty, Depth, Stochasticity, and More [10.65078014704416]
When training deep neural networks with sharpness often increases, before saturating at the edge of stability.<n>In this work, we study this phenomenon using a minimalist model: a deep linear network with a single neuron per layer.<n>We show that this simple model effectively captures the sharpness dynamics observed in recent empirical studies, offering a simple testbed to better understand neural network training.
arXiv Detail & Related papers (2025-06-07T22:35:13Z) - Disentangling the Causes of Plasticity Loss in Neural Networks [55.23250269007988]
We show that loss of plasticity can be decomposed into multiple independent mechanisms.
We show that a combination of layer normalization and weight decay is highly effective at maintaining plasticity in a variety of synthetic nonstationary learning tasks.
arXiv Detail & Related papers (2024-02-29T00:02:33Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Harnessing the Power of Explanations for Incremental Training: A
LIME-Based Approach [6.244905619201076]
In this work, model explanations are fed back to the feed-forward training to help the model generalize better.
The framework incorporates the custom weighted loss with Elastic Weight Consolidation (EWC) to maintain performance in sequential testing sets.
The proposed custom training procedure results in a consistent enhancement of accuracy ranging from 0.5% to 1.5% throughout all phases of the incremental learning setup.
arXiv Detail & Related papers (2022-11-02T18:16:17Z) - Critical Learning Periods for Multisensory Integration in Deep Networks [112.40005682521638]
We show that the ability of a neural network to integrate information from diverse sources hinges critically on being exposed to properly correlated signals during the early phases of training.
We show that critical periods arise from the complex and unstable early transient dynamics, which are decisive of final performance of the trained system and their learned representations.
arXiv Detail & Related papers (2022-10-06T23:50:38Z) - Symplectic Momentum Neural Networks -- Using Discrete Variational
Mechanics as a prior in Deep Learning [7.090165638014331]
This paper introduces Sympic Momentum Networks (SyMo) as models from a discrete formulation of mechanics for non-separable mechanical systems.
We show that such combination not only allows these models tol earn from limited data but also provides the models with the capability of preserving the symplectic form and show better long-term behaviour.
arXiv Detail & Related papers (2022-01-20T16:33:19Z) - Reducing Catastrophic Forgetting in Self Organizing Maps with
Internally-Induced Generative Replay [67.50637511633212]
A lifelong learning agent is able to continually learn from potentially infinite streams of pattern sensory data.
One major historic difficulty in building agents that adapt is that neural systems struggle to retain previously-acquired knowledge when learning from new samples.
This problem is known as catastrophic forgetting (interference) and remains an unsolved problem in the domain of machine learning to this day.
arXiv Detail & Related papers (2021-12-09T07:11:14Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.