Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension
- URL: http://arxiv.org/abs/2502.05075v5
- Date: Fri, 20 Jun 2025 10:26:36 GMT
- Title: Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension
- Authors: Yijun Dong, Yicheng Li, Yunai Li, Jason D. Lee, Qi Lei,
- Abstract summary: Weak-to-strong (W2S) generalization is a type of finetuning (FT) where a strong student model is trained on pseudo-labels generated by a weak teacher.<n>We analyze W2S in the ridgeless regression setting from a variance reduction perspective.
- Score: 48.431551146556714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weak-to-strong (W2S) generalization is a type of finetuning (FT) where a strong (large) student model is trained on pseudo-labels generated by a weak teacher. Surprisingly, W2S FT often outperforms the weak teacher. We seek to understand this phenomenon through the observation that FT often occurs in intrinsically low-dimensional spaces. Leveraging the low intrinsic dimensionality of FT, we analyze W2S in the ridgeless regression setting from a variance reduction perspective. For a strong student-weak teacher pair with sufficiently expressive low-dimensional feature subspaces $\mathcal{V}_s, \mathcal{V}_w$, we provide an exact characterization of the variance that dominates the generalization error of W2S. This unveils a virtue of discrepancy between the strong and weak models in W2S: the variance of the weak teacher is inherited by the strong student in $\mathcal{V}_s \cap \mathcal{V}_w$, while reduced by a factor of $\mathrm{dim}(\mathcal{V}_s)/N$ in the subspace of discrepancy $\mathcal{V}_w \setminus \mathcal{V}_s$ with $N$ pseudo-labels for W2S. Our analysis further casts light on the sample complexities and the scaling of performance gap recovery in W2S. The analysis is supported by experiments on synthetic regression problems, as well as real vision and NLP tasks.
Related papers
- On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature [1.6773271875801752]
Gradient Descent (SGD) introduces anisotropic noise that is correlated with the local curvature of the loss landscape, thereby biasing optimization toward flat minima.<n>We show that this assumption holds only under restrictive conditions that are typically violated in deep neural networks.<n>Experiments across datasets, architectures, and loss functions validate these bounds, providing a unified characterization of the noise-curvature relationship in deep learning.
arXiv Detail & Related papers (2026-02-05T12:35:13Z) - Does Weak-to-strong Generalization Happen under Spurious Correlations? [17.02943058643617]
Key problem in weak-to-strong (W2S) generalization: when fine-tuning a strong pre-trained student with pseudolabels from a weaker teacher on a downstream task with spurious correlations, does W2S happen, and how to improve it upon failures?<n>We consider two sources of spurious correlations caused by group imbalance: (i) a weak teacher fine-tuned on group-imbalanced labeled data with a minority group of fraction $eta_ell$, and (ii) a group-imbalanced unlabeled set pseudolabeled by teacher with minority fraction $eta_u$.
arXiv Detail & Related papers (2025-09-28T17:57:49Z) - On the Emergence of Weak-to-Strong Generalization: A Bias-Variance Perspective [14.65315912348303]
Weak-to-strong generalization (W2SG) refers to the phenomenon where a strong student model, trained on a dataset labeled by a weak teacher, outperforms the teacher on the target task.<n>Recent studies attribute this performance gain to the prediction misfit between the student and teacher models.<n>We show that W2SG is more likely to emerge when the student model approximates its posterior mean teacher, rather than mimicking an individual teacher.
arXiv Detail & Related papers (2025-05-30T07:52:43Z) - Emergence and scaling laws in SGD learning of shallow neural networks [64.48316762675141]
We study the complexity of online gradient descent (SGD) for learning a two-layer neural network with $P$ neurons on isotropic Gaussian data.<n>We provide a precise analysis of SGD dynamics for the training of a student two-layer network to minimize the mean squared error (MSE) objective.
arXiv Detail & Related papers (2025-04-28T16:58:55Z) - Criteria and Bias of Parameterized Linear Regression under Edge of Stability Regime [38.134523847923646]
Edge of Stability (EoS) is usually known as the Edge of Stability (EoS) phenomenon.
We show that EoS occurs even when $l$ is quadratic under proper conditions.
We also shed some new light on the implicit bias of diagonal linear networks when a larger step-size is employed.
arXiv Detail & Related papers (2024-12-11T02:07:37Z) - A Unified Analysis for Finite Weight Averaging [50.75116992029417]
Averaging iterations of Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA)
In this paper, we generalize LAWA as Finite Weight Averaging (FWA) and explain their advantages compared to SGD from the perspective of optimization and generalization.
arXiv Detail & Related papers (2024-11-20T10:08:22Z) - A Statistical Analysis of Deep Federated Learning for Intrinsically Low-dimensional Data [31.52603443208588]
This paper delves into the generalization properties of deep federated regression within a two-stage sampling model.<n>Our findings reveal that the intrinsic dimension, characterized by the entropic dimension, plays a pivotal role in determining the convergence rates for deep learners.
arXiv Detail & Related papers (2024-10-28T01:36:25Z) - Learning Orthogonal Multi-Index Models: A Fine-Grained Information Exponent Analysis [54.57279006229212]
Information exponent has played an important role in predicting the sample complexity of online gradient descent.<n>In this work, we show that by considering both second- and higher-order terms, we can first learn the relevant space using the second-order terms.<n>The overall sample and complexity of online SGD is $tildeO( d PL-1 )$.
arXiv Detail & Related papers (2024-10-13T00:14:08Z) - Wasserstein Distributionally Robust Multiclass Support Vector Machine [1.8570591025615457]
We study the problem of multiclass classification for settings where data features $mathbfx$ and their labels $mathbfy$ are uncertain.
We use Wasserstein distributionally robust optimization to develop a robust version of the multiclass support vector machine (SVM) characterized by the Crammer-Singer (CS) loss.
Our numerical experiments demonstrate that our model outperforms state-of-the art OVA models in settings where the training data is highly imbalanced.
arXiv Detail & Related papers (2024-09-12T21:40:04Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - On Learning Latent Models with Multi-Instance Weak Supervision [57.18649648182171]
We consider a weakly supervised learning scenario where the supervision signal is generated by a transition function $sigma$ labels associated with multiple input instances.
Our problem is met in different fields, including latent structural learning and neuro-symbolic integration.
arXiv Detail & Related papers (2023-06-23T22:05:08Z) - How many dimensions are required to find an adversarial example? [0.0]
We investigate how adversarial vulnerability depends on $dim(V)$.
In particular, we show that the adversarial success of standard PGD attacks with $ellp$ norm constraints behaves like a monotonically increasing function of $epsilon.
arXiv Detail & Related papers (2023-03-24T17:36:15Z) - Statistical Learning under Heterogeneous Distribution Shift [71.8393170225794]
Ground-truth predictor is additive $mathbbE[mathbfz mid mathbfx,mathbfy] = f_star(mathbfx) +g_star(mathbfy)$.
arXiv Detail & Related papers (2023-02-27T16:34:21Z) - Universality class of Ising critical states with long-range losses [0.0]
We show that spatial resolved dissipation can act on $d$-dimensional spin systems in the Ising universality class.
We consider power-law decaying spin losses with a Lindbladian spectrum closing at small momenta as $propto qalpha$.
arXiv Detail & Related papers (2021-08-27T17:59:51Z) - Analytic Study of Families of Spurious Minima in Two-Layer ReLU Neural
Networks [15.711517003382484]
We show that the Hessian spectrum concentrates near positives, with the exception of $Theta(d)$ eigenvalues which growly with$d$.
This makes possible the creation and the annihilation of minima using powerful tools from bifurcation theory.
arXiv Detail & Related papers (2021-07-21T22:05:48Z) - Locality defeats the curse of dimensionality in convolutional
teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$.
We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z) - Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins [92.7662890047311]
We show that gradient descent finds halfspaces with classification error $tilde O(mathsfOPT1/2) + varepsilon$ in $mathrmpoly(d,1/varepsilon)$ time and sample complexity.
arXiv Detail & Related papers (2020-10-01T16:48:33Z) - Optimization and Generalization of Shallow Neural Networks with
Quadratic Activation Functions [11.70706646606773]
We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks.
We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width.
We show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error.
arXiv Detail & Related papers (2020-06-27T22:13:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.