Related papers: From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning

From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning

URL: http://arxiv.org/abs/2510.24812v1
Date: Tue, 28 Oct 2025 07:53:24 GMT
Title: From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning
Authors: Junsoo Oh, Jerry Song, Chulhee Yun,
Abstract summary: We provide a formal analysis of weak-to-strong generalization from a linear CNN (weak) to a two-layer ReLU CNN (strong)<n>Our analysis identifies two regimes -- data-scarce and data-abundant -- based on the signal-to-noise characteristics of the dataset.
Score: 27.3606707777401
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Weak-to-strong generalization refers to the phenomenon where a stronger model trained under supervision from a weaker one can outperform its teacher. While prior studies aim to explain this effect, most theoretical insights are limited to abstract frameworks or linear/random feature models. In this paper, we provide a formal analysis of weak-to-strong generalization from a linear CNN (weak) to a two-layer ReLU CNN (strong). We consider structured data composed of label-dependent signals of varying difficulty and label-independent noise, and analyze gradient descent dynamics when the strong model is trained on data labeled by the pretrained weak model. Our analysis identifies two regimes -- data-scarce and data-abundant -- based on the signal-to-noise characteristics of the dataset, and reveals distinct mechanisms of weak-to-strong generalization. In the data-scarce regime, generalization occurs via benign overfitting or fails via harmful overfitting, depending on the amount of data, and we characterize the transition boundary. In the data-abundant regime, generalization emerges in the early phase through label correction, but we observe that overtraining can subsequently degrade performance.

Related papers

Balanced Anomaly-guided Ego-graph Diffusion Model for Inductive Graph Anomaly Detection [20.567053994822867]
Graph anomaly detection is crucial in applications like fraud detection and cybersecurity.<n>We propose a novel data-centric framework that integrates dynamic graph modeling with balanced anomaly synthesis.<n>Our framework features: (1) a discrete ego-graph diffusion model, which captures the local topology of anomalies to generate ego-graphs aligned with anomalous structural distribution, and (2) a curriculum anomaly augmentation mechanism, which dynamically adjusts synthetic data generation during training, focusing on underrepresented anomaly patterns to improve detection and generalization.
arXiv Detail & Related papers (2026-02-05T02:46:54Z)
The Impact of Anisotropic Covariance Structure on the Training Dynamics and Generalization Error of Linear Networks [10.58305210918603]
We study the impact of data anisotropy on the learning dynamics and generalization error of a two-layer linear network in a linear regression setting.<n>Our findings offer deep theoretical insights into how data anisotropy shapes the learning trajectory and final performance.
arXiv Detail & Related papers (2026-01-11T15:37:39Z)
Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data [89.96277093034547]
We introduce EntroDrop, an entropy-guided token dropout method that functions as structured data regularization.<n>We show that EntroDrop consistently outperforms standard regularization baselines and maintains robust performance throughout extended multi-epoch training.
arXiv Detail & Related papers (2025-12-29T12:35:51Z)
Robustness of Probabilistic Models to Low-Quality Data: A Multi-Perspective Analysis [23.834741751854448]
A systematic, comparative investigation into the effects of low-quality data reveals a stark spectrum of robustness across modern probabilistic models.<n>We find that autoregressive language models, from token prediction to sequence-to-sequence tasks, are remarkably resilient.<n>Under the same levels of data corruption, class-conditional diffusion models degrade catastrophically.
arXiv Detail & Related papers (2025-12-11T02:10:41Z)
A Large-scale Benchmark on Geological Fault Delineation Models: Domain Shift, Training Dynamics, Generalizability, Evaluation and Inferential Behavior [11.859145373647474]
We present the first large-scale benchmarking study designed to provide guidelines for domain shift strategies in seismic interpretation.<n>Our benchmark spans over 200 combinations of model architectures, datasets and training strategies, across three datasets.<n>Our analysis shows that common fine-tuning practices can lead to catastrophic forgetting when source and target datasets are disjoint.
arXiv Detail & Related papers (2025-05-13T13:56:43Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted.<n>Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs)<n>Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z)
Restoring balance: principled under/oversampling of data for optimal classification [0.0]
Class imbalance in real-world data poses a common bottleneck for machine learning tasks.<n> Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically.<n>We provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered.
arXiv Detail & Related papers (2024-05-15T17:45:34Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)
Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data [39.53312099194621]
We consider the generalization error of two-layer neural networks trained to generalize by gradient descent.<n>We show that neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly fitting any noisy training labels, and simultaneously achieve minimax optimal test error.<n>In contrast to previous work on benign overfitting that require linear or kernel-based predictors, our analysis holds in a setting where both the model and learning dynamics are fundamentally nonlinear.
arXiv Detail & Related papers (2022-02-11T23:04:00Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.