Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension
- URL: http://arxiv.org/abs/2502.05075v5
- Date: Fri, 20 Jun 2025 10:26:36 GMT
- Title: Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension
- Authors: Yijun Dong, Yicheng Li, Yunai Li, Jason D. Lee, Qi Lei,
- Abstract summary: Weak-to-strong (W2S) generalization is a type of finetuning (FT) where a strong student model is trained on pseudo-labels generated by a weak teacher.<n>We analyze W2S in the ridgeless regression setting from a variance reduction perspective.
- Score: 48.431551146556714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weak-to-strong (W2S) generalization is a type of finetuning (FT) where a strong (large) student model is trained on pseudo-labels generated by a weak teacher. Surprisingly, W2S FT often outperforms the weak teacher. We seek to understand this phenomenon through the observation that FT often occurs in intrinsically low-dimensional spaces. Leveraging the low intrinsic dimensionality of FT, we analyze W2S in the ridgeless regression setting from a variance reduction perspective. For a strong student-weak teacher pair with sufficiently expressive low-dimensional feature subspaces $\mathcal{V}_s, \mathcal{V}_w$, we provide an exact characterization of the variance that dominates the generalization error of W2S. This unveils a virtue of discrepancy between the strong and weak models in W2S: the variance of the weak teacher is inherited by the strong student in $\mathcal{V}_s \cap \mathcal{V}_w$, while reduced by a factor of $\mathrm{dim}(\mathcal{V}_s)/N$ in the subspace of discrepancy $\mathcal{V}_w \setminus \mathcal{V}_s$ with $N$ pseudo-labels for W2S. Our analysis further casts light on the sample complexities and the scaling of performance gap recovery in W2S. The analysis is supported by experiments on synthetic regression problems, as well as real vision and NLP tasks.
Related papers
- On the Emergence of Weak-to-Strong Generalization: A Bias-Variance Perspective [14.65315912348303]
Weak-to-strong generalization (W2SG) refers to the phenomenon where a strong student model, trained on a dataset labeled by a weak teacher, outperforms the teacher on the target task.<n>Recent studies attribute this performance gain to the prediction misfit between the student and teacher models.<n>We show that W2SG is more likely to emerge when the student model approximates its posterior mean teacher, rather than mimicking an individual teacher.
arXiv Detail & Related papers (2025-05-30T07:52:43Z) - Criteria and Bias of Parameterized Linear Regression under Edge of Stability Regime [38.134523847923646]
Edge of Stability (EoS) is usually known as the Edge of Stability (EoS) phenomenon.
We show that EoS occurs even when $l$ is quadratic under proper conditions.
We also shed some new light on the implicit bias of diagonal linear networks when a larger step-size is employed.
arXiv Detail & Related papers (2024-12-11T02:07:37Z) - A Unified Analysis for Finite Weight Averaging [50.75116992029417]
Averaging iterations of Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA)
In this paper, we generalize LAWA as Finite Weight Averaging (FWA) and explain their advantages compared to SGD from the perspective of optimization and generalization.
arXiv Detail & Related papers (2024-11-20T10:08:22Z) - Wasserstein Distributionally Robust Multiclass Support Vector Machine [1.8570591025615457]
We study the problem of multiclass classification for settings where data features $mathbfx$ and their labels $mathbfy$ are uncertain.
We use Wasserstein distributionally robust optimization to develop a robust version of the multiclass support vector machine (SVM) characterized by the Crammer-Singer (CS) loss.
Our numerical experiments demonstrate that our model outperforms state-of-the art OVA models in settings where the training data is highly imbalanced.
arXiv Detail & Related papers (2024-09-12T21:40:04Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - On Learning Latent Models with Multi-Instance Weak Supervision [57.18649648182171]
We consider a weakly supervised learning scenario where the supervision signal is generated by a transition function $sigma$ labels associated with multiple input instances.
Our problem is met in different fields, including latent structural learning and neuro-symbolic integration.
arXiv Detail & Related papers (2023-06-23T22:05:08Z) - How many dimensions are required to find an adversarial example? [0.0]
We investigate how adversarial vulnerability depends on $dim(V)$.
In particular, we show that the adversarial success of standard PGD attacks with $ellp$ norm constraints behaves like a monotonically increasing function of $epsilon.
arXiv Detail & Related papers (2023-03-24T17:36:15Z) - Statistical Learning under Heterogeneous Distribution Shift [71.8393170225794]
Ground-truth predictor is additive $mathbbE[mathbfz mid mathbfx,mathbfy] = f_star(mathbfx) +g_star(mathbfy)$.
arXiv Detail & Related papers (2023-02-27T16:34:21Z) - Universality class of Ising critical states with long-range losses [0.0]
We show that spatial resolved dissipation can act on $d$-dimensional spin systems in the Ising universality class.
We consider power-law decaying spin losses with a Lindbladian spectrum closing at small momenta as $propto qalpha$.
arXiv Detail & Related papers (2021-08-27T17:59:51Z) - Analytic Study of Families of Spurious Minima in Two-Layer ReLU Neural
Networks [15.711517003382484]
We show that the Hessian spectrum concentrates near positives, with the exception of $Theta(d)$ eigenvalues which growly with$d$.
This makes possible the creation and the annihilation of minima using powerful tools from bifurcation theory.
arXiv Detail & Related papers (2021-07-21T22:05:48Z) - Locality defeats the curse of dimensionality in convolutional
teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$.
We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z) - Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins [92.7662890047311]
We show that gradient descent finds halfspaces with classification error $tilde O(mathsfOPT1/2) + varepsilon$ in $mathrmpoly(d,1/varepsilon)$ time and sample complexity.
arXiv Detail & Related papers (2020-10-01T16:48:33Z) - Optimization and Generalization of Shallow Neural Networks with
Quadratic Activation Functions [11.70706646606773]
We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks.
We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width.
We show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error.
arXiv Detail & Related papers (2020-06-27T22:13:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.