Over-Alignment vs Over-Fitting: The Role of Feature Learning Strength in Generalization
- URL: http://arxiv.org/abs/2602.00827v1
- Date: Sat, 31 Jan 2026 17:43:02 GMT
- Title: Over-Alignment vs Over-Fitting: The Role of Feature Learning Strength in Generalization
- Authors: Taesun Yeom, Taehyeok Ha, Jaeho Lee,
- Abstract summary: We develop a theoretical analysis of gradient flow dynamics in two-layer ReLU nets trained with logistic loss.<n>An excessively large FLS induces an $textitover-alignment$ phenomenon that degrades generalization, while an overly small FLS leads to $textitover-fitting$.
- Score: 8.58740389510812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Feature learning strength (FLS), i.e., the inverse of the effective output scaling of a model, plays a critical role in shaping the optimization dynamics of neural nets. While its impact has been extensively studied under the asymptotic regimes -- both in training time and FLS -- existing theory offers limited insight into how FLS affects generalization in practical settings, such as when training is stopped upon reaching a target training risk. In this work, we investigate the impact of FLS on generalization in deep networks under such practical conditions. Through empirical studies, we first uncover the emergence of an $\textit{optimal FLS}$ -- neither too small nor too large -- that yields substantial generalization gains. This finding runs counter to the prevailing intuition that stronger feature learning universally improves generalization. To explain this phenomenon, we develop a theoretical analysis of gradient flow dynamics in two-layer ReLU nets trained with logistic loss, where FLS is controlled via initialization scale. Our main theoretical result establishes the existence of an optimal FLS arising from a trade-off between two competing effects: An excessively large FLS induces an $\textit{over-alignment}$ phenomenon that degrades generalization, while an overly small FLS leads to $\textit{over-fitting}$.
Related papers
- Data Distribution as a Lever for Guiding Optimizers Toward Superior Generalization in LLMs [60.68927774057402]
We show, for the first time, that a lower simplicity bias induces a better generalization.<n>Motivated by this insight, we demonstrate that the training data distribution by upsampling or augmenting examples learned later in training similarly reduces SB and leads to improved generalization.<n>Our strategy improves the performance of multiple language models including Phi2-2.7B, Llama3.2-1B, Gemma3-1B-PT, Qwen3-0.6B-Base-achieving relative accuracy gains up to 18% when fine-tuned with AdamW and Muon.
arXiv Detail & Related papers (2026-01-31T07:40:36Z) - How to Set the Learning Rate for Large-Scale Pre-training? [73.03133634525635]
We formalize this investigation into two distinct research paradigms: Fitting and Transfer.<n>Within the Fitting Paradigm, we introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n3) to O(n*C_D*C_) via predictive modeling.<n>We extend the principles of $$Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons.
arXiv Detail & Related papers (2026-01-08T15:55:13Z) - How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns [51.02752099869218]
Large Language Models (LLMs) display strikingly different generalization behaviors.<n>We introduce a novel benchmark that decomposes reasoning into atomic core skills.<n>We show that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns.
arXiv Detail & Related papers (2025-12-30T08:16:20Z) - Why Do Transformers Fail to Forecast Time Series In-Context? [21.43699354236011]
Time series forecasting (TSF) remains a challenging and largely unsolved problem in machine learning.<n> Empirical evidence consistently shows that even powerful Transformers often fail to outperform simpler models.<n>We provide a theoretical analysis of Transformers' limitations for TSF through the lens of In-Context Learning (ICL) theory.
arXiv Detail & Related papers (2025-10-10T18:34:19Z) - Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs [13.036236161537147]
Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training.<n>A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves.<n>However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored.
arXiv Detail & Related papers (2025-09-25T11:51:05Z) - SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs [53.77646961962239]
Supervised Fine-Tuning (SFT) is a common approach to adapt Large Language Models (LLMs) to specialized tasks.<n>We show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation.
arXiv Detail & Related papers (2025-09-25T05:28:22Z) - Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules [9.332823269318842]
Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models.<n>We establish a Functional Scaling Law that captures the full loss trajectory under arbitrary LRSs.<n>We derive explicit scaling relations in both data- and compute-limited regimes.
arXiv Detail & Related papers (2025-09-23T16:05:16Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Efficient and Flexible Neural Network Training through Layer-wise Feedback Propagation [49.44309457870649]
Layer-wise Feedback feedback (LFP) is a novel training principle for neural network-like predictors.<n>LFP decomposes a reward to individual neurons based on their respective contributions.<n>Our method then implements a greedy reinforcing approach helpful parts of the network and weakening harmful ones.
arXiv Detail & Related papers (2023-08-23T10:48:28Z) - Deep Active Learning by Leveraging Training Dynamics [57.95155565319465]
We propose a theory-driven deep active learning method (dynamicAL) which selects samples to maximize training dynamics.
We show that dynamicAL not only outperforms other baselines consistently but also scales well on large deep learning models.
arXiv Detail & Related papers (2021-10-16T16:51:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.