Related papers: Enlightenment Period Improving DNN Performance

Enlightenment Period Improving DNN Performance

URL: http://arxiv.org/abs/2504.01737v2
Date: Wed, 29 Oct 2025 06:27:21 GMT
Title: Enlightenment Period Improving DNN Performance
Authors: Tiantian Liu, Meng Wan, Jue Wang, Ningming Nie,
Abstract summary: We show that applying Mixup data augmentation during this phase has a dual effect: it introduces a Gradient Interference Effect that hinders performance.<n>We propose three strategies that improve performance by solely adjusting the training data distribution within this brief period.
Score: 6.039062704986015
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The start of deep neural network training is characterized by a brief yet critical phase that lasts from the beginning of the training until the accuracy reaches approximately 50\%. During this phase, disordered representations rapidly transition toward ordered structure, and we term this phase the Enlightenment Period. Through theoretical modeling based on phase transition theory and experimental validation, we reveal that applying Mixup data augmentation during this phase has a dual effect: it introduces a Gradient Interference Effect that hinders performance, while also providing a beneficial Activation Revival Effect to restore gradient updates for saturated neurons. We further demonstrate that this negative interference diminishes as the sample set size or the model parameter size increases, thereby shifting the balance between these two effects. Based on these findings, we propose three strategies that improve performance by solely adjusting the training data distribution within this brief period: the Mixup Pause Strategy for small-scale scenarios, the Alpha Boost Strategy for large-scale scenarios with underfitting, and the High-Loss Removal Strategy for tasks where Mixup is inapplicable (e.g., time series and large language models). Extensive experiments show that these strategies achieve superior performance across diverse architectures such as ViT and ResNet on datasets including CIFAR and ImageNet-1K. Ultimately, this work offers a novel perspective on enhancing model performance by strategically capitalizing on the dynamics of the brief and crucial early stages of training. Code is available at https://anonymous.4open.science/r/code-A5F1/.

Related papers

Efficient Conditional Generation on Scale-based Visual Autoregressive Models [26.81493253536486]
Efficient Control Model (ECM) is a plug-and-play framework featuring a lightweight control module that introduces control signals via a distributed architecture.<n> ECM refines conditional features using real-time generated tokens, and a shared feed-forward network (FFN) designed to maximize the utilization of its limited capacity.<n>Our method achieves high-fidelity and diverse control over image generation, surpassing existing baselines while significantly improving both training and inference efficiency.
arXiv Detail & Related papers (2025-10-07T06:27:03Z)
Learning to Reason as Action Abstractions with Scalable Mid-Training RL [55.24192942739207]
An effective mid-training phase should identify a compact set of useful actions and enable fast selection.<n>We propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm.
arXiv Detail & Related papers (2025-09-30T05:34:20Z)
Decomposing the Entropy-Performance Exchange: The Missing Keys to Unlocking Effective Reinforcement Learning [106.68304931854038]
Reinforcement learning with verifiable rewards (RLVR) has been widely used for enhancing the reasoning abilities of large language models (LLMs)<n>We conduct a systematic empirical analysis of the entropy-performance exchange mechanism of RLVR across different levels of granularity.<n>Our analysis reveals that, in the rising stage, entropy reduction in negative samples facilitates the learning of effective reasoning patterns.<n>In the plateau stage, learning efficiency strongly correlates with high-entropy tokens present in low-perplexity samples and those located at the end of sequences.
arXiv Detail & Related papers (2025-08-04T10:08:10Z)
Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers [5.187307904567701]
We propose a magnitude-preserving design that stabilizes training without normalization layers.<n>Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation.<n>We show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $sim$12.8%.
arXiv Detail & Related papers (2025-05-25T12:25:50Z)
Quiet Feature Learning in Algorithmic Tasks [1.9249287163937978]
We train Transformer-based language models on ten foundational algorithmic tasks.<n>We observe pronounced phase transitions in their loss curves that deviate from established power-law scaling trends.
arXiv Detail & Related papers (2025-05-06T22:18:50Z)
An Overview of Low-Rank Structures in the Training and Adaptation of Large Models [52.67110072923365]
Recent research has uncovered a widespread phenomenon in deep networks: the emergence of low-rank structures. These implicit low-dimensional patterns provide valuable insights for improving the efficiency of training and fine-tuning large-scale models. We present a comprehensive review of advances in exploiting low-rank structures for deep learning and shed light on their mathematical foundations.
arXiv Detail & Related papers (2025-03-25T17:26:09Z)
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.<n>We introduce novel algorithms for dynamic, instance-level data reweighting.<n>Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z)
Spurious Forgetting in Continual Learning of Language Models [20.0936011355535]
Recent advancements in large language models (LLMs) reveal a perplexing phenomenon in continual learning. Despite extensive training, models experience significant performance declines. This study proposes that such performance drops often reflect a decline in task alignment rather than true knowledge loss.
arXiv Detail & Related papers (2025-01-23T08:09:54Z)
On Multi-Stage Loss Dynamics in Neural Networks: Mechanisms of Plateau and Descent Stages [1.5235340620594793]
We identify three distinct stages observed in the loss curve during training: the initial plateau stage, the initial descent stage, and the secondary plateau stage. Through rigorous analysis, we reveal the underlying challenges contributing to slow training during the plateau stages.
arXiv Detail & Related papers (2024-10-26T08:16:00Z)
Towards Robust Transcription: Exploring Noise Injection Strategies for Training Data Augmentation [55.752737615873464]
This study investigates the impact of white noise at various Signal-to-Noise Ratio (SNR) levels on state-of-the-art APT models. We hope this research provides valuable insights as preliminary work toward developing transcription models that maintain consistent performance across a range of acoustic conditions.
arXiv Detail & Related papers (2024-10-18T02:31:36Z)
Improved Noise Schedule for Diffusion Training [51.849746576387375]
We propose a novel approach to design the noise schedule for enhancing the training of diffusion models.<n>We empirically demonstrate the superiority of our noise schedule over the standard cosine schedule.
arXiv Detail & Related papers (2024-07-03T17:34:55Z)
Contrastive-Adversarial and Diffusion: Exploring pre-training and fine-tuning strategies for sulcal identification [3.0398616939692777]
Techniques like adversarial learning, contrastive learning, diffusion denoising learning, and ordinary reconstruction learning have become standard. The study aims to elucidate the advantages of pre-training techniques and fine-tuning strategies to enhance the learning process of neural networks.
arXiv Detail & Related papers (2024-05-29T15:44:51Z)
Impact of Noisy Supervision in Foundation Model Learning [91.56591923244943]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.<n>We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z)
Efficient Stagewise Pretraining via Progressive Subnetworks [53.00045381931778]
The prevailing view suggests that stagewise dropping strategies, such as layer dropping, are ineffective when compared to stacking-based approaches. This paper challenges this notion by demonstrating that, with proper design, dropping strategies can be competitive, if not better, than stacking methods. We propose an instantiation of this framework - Random Part Training (RAPTR) - that selects and trains only a random subnetwork at each step, progressively increasing the size in stages.
arXiv Detail & Related papers (2024-02-08T18:49:09Z)
Adaptive Training Meets Progressive Scaling: Elevating Efficiency in Diffusion Models [52.1809084559048]
We propose a novel two-stage divide-and-conquer training strategy termed TDC Training.<n>It groups timesteps based on task similarity and difficulty, assigning highly customized denoising models to each group, thereby enhancing the performance of diffusion models.<n>While two-stage training avoids the need to train each model separately, the total training cost is even lower than training a single unified denoising model.
arXiv Detail & Related papers (2023-12-20T03:32:58Z)
Unraveling the Temporal Dynamics of the Unet in Diffusion Models [33.326244121918634]
Diffusion models introduce Gaussian noise into training data and reconstruct the original data iteratively. Central to this iterative process is a single Unet, adapting across time steps to facilitate generation. Recent work revealed the presence of composition and denoising phases in this generation process.
arXiv Detail & Related papers (2023-12-17T04:40:33Z)
Analyzing and Improving the Training Dynamics of Diffusion Models [36.37845647984578]
We identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity.
arXiv Detail & Related papers (2023-12-05T11:55:47Z)
Small-scale proxies for large-scale Transformer training instabilities [69.36381318171338]
We seek ways to reproduce and study training stability and instability at smaller scales. By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates. We study methods such as warm-up, weight decay, and the $mu$Param to train small models that achieve similar losses across orders of magnitude of learning rate variation.
arXiv Detail & Related papers (2023-09-25T17:48:51Z)
A Loss Curvature Perspective on Training Instability in Deep Learning [28.70491071044542]
We study the evolution of the loss Hessian across many classification tasks in order to understand the effect curvature of the loss has on the training dynamics. Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization.
arXiv Detail & Related papers (2021-10-08T20:25:48Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
The Break-Even Point on Optimization Trajectories of Deep Neural Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory. We show that using a large learning rate in the initial phase of training reduces the variance of the gradient. We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.