On the Emergence of Weak-to-Strong Generalization: A Bias-Variance Perspective
- URL: http://arxiv.org/abs/2505.24313v1
- Date: Fri, 30 May 2025 07:52:43 GMT
- Title: On the Emergence of Weak-to-Strong Generalization: A Bias-Variance Perspective
- Authors: Gengze Xu, Wei Yao, Ziqiao Wang, Yong Liu,
- Abstract summary: Weak-to-strong generalization (W2SG) refers to the phenomenon where a strong student model, trained on a dataset labeled by a weak teacher, outperforms the teacher on the target task.<n>Recent studies attribute this performance gain to the prediction misfit between the student and teacher models.<n>We show that W2SG is more likely to emerge when the student model approximates its posterior mean teacher, rather than mimicking an individual teacher.
- Score: 14.65315912348303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weak-to-strong generalization (W2SG) refers to the phenomenon where a strong student model, trained on a dataset labeled by a weak teacher, ultimately outperforms the teacher on the target task. Recent studies attribute this performance gain to the prediction misfit between the student and teacher models. In this work, we theoretically investigate the emergence of W2SG through a generalized bias-variance decomposition of Bregman divergence. Specifically, we show that the expected population risk gap between the student and teacher is quantified by the expected misfit between the two models. While this aligns with previous results, our analysis removes several restrictive assumptions, most notably, the convexity of the student's hypothesis class, required in earlier works. Moreover, we show that W2SG is more likely to emerge when the student model approximates its posterior mean teacher, rather than mimicking an individual teacher. Using a concrete example, we demonstrate that if the student model has significantly larger capacity than the teacher, it can indeed converge to this posterior mean. Our analysis also suggests that avoiding overfitting to the teacher's supervision and reducing the entropy of student's prediction further facilitate W2SG. In addition, we show that the reverse cross-entropy loss, unlike the standard forward cross-entropy, is less sensitive to the predictive uncertainty of the teacher. Finally, we empirically verify our theoretical insights and demonstrate that incorporating the reverse cross-entropy loss consistently improves student performance.
Related papers
- On the Mechanisms of Weak-to-Strong Generalization: A Theoretical Perspective [28.005935031887038]
Weak-to-strong generalization, where a student model trained on imperfect labels surpasses that teacher, has been widely observed.<n>In this paper, through a theoretical analysis of simple models, we uncover three core mechanisms that can drive this phenomenon.
arXiv Detail & Related papers (2025-05-23T20:09:09Z) - Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization [69.96794098855938]
Weak-to-strong generalization (W2SG) offers a promising framework for supervising increasingly capable language models (LLMs)<n>Traditional W2SG methods rely on passive learning, where a weak teacher provides noisy demonstrations to train a strong student.<n>We introduce Alice, a framework that leverages complementary knowledge between teacher and student to enhance the learning process.
arXiv Detail & Related papers (2025-04-09T22:33:06Z) - Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension [48.431551146556714]
Weak-to-strong (W2S) generalization is a type of finetuning (FT) where a strong student model is trained on pseudo-labels generated by a weak teacher.<n>We analyze W2S in the ridgeless regression setting from a variance reduction perspective.
arXiv Detail & Related papers (2025-02-07T16:46:43Z) - Theoretical Analysis of Weak-to-Strong Generalization [23.235671743867492]
We show that existing weak supervision theory fails to account for pseudolabel correction and coverage expansion.
Our bounds capture the intuition that weak-to-strong generalization occurs when the strong model is unable to fit the mistakes of the weak teacher without incurring additional error.
arXiv Detail & Related papers (2024-05-25T03:48:12Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - A U-turn on Double Descent: Rethinking Parameter Counting in Statistical
Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n.
This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z) - Class-Imbalanced Graph Learning without Class Rebalancing [62.1368829847041]
Class imbalance is prevalent in real-world node classification tasks and poses great challenges for graph learning models.
In this work, we approach the root cause of class-imbalance bias from an topological paradigm.
We devise a lightweight topological augmentation framework BAT to mitigate the class-imbalance bias without class rebalancing.
arXiv Detail & Related papers (2023-08-27T19:01:29Z) - On student-teacher deviations in distillation: does it pay to disobey? [54.908344098305804]
Knowledge distillation has been widely used to improve the test accuracy of a "student" network.
Despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo the teacher in performance.
arXiv Detail & Related papers (2023-01-30T14:25:02Z) - Learning curves for the multi-class teacher-student perceptron [5.480546613836199]
One of the most classical results in high-dimensional learning theory provides a closed-form expression for the generalisation error of binary classification.
Both Bayes-optimal estimation and empirical risk minimisation (ERM) were extensively analysed for this setting.
Yet, an analogous analysis for the corresponding multi-student perceptron was missing.
arXiv Detail & Related papers (2022-03-22T23:16:36Z) - On the Role of Optimization in Double Descent: A Least Squares Study [30.44215064390409]
We show an excess risk bound for the descent gradient solution of the least squares objective.
We find that in case of noiseless regression, double descent is explained solely by optimization-related quantities.
We empirically explore if our predictions hold for neural networks.
arXiv Detail & Related papers (2021-07-27T09:13:11Z) - Optimization Variance: Exploring Generalization Properties of DNNs [83.78477167211315]
The test error of a deep neural network (DNN) often demonstrates double descent.
We propose a novel metric, optimization variance (OV), to measure the diversity of model updates.
arXiv Detail & Related papers (2021-06-03T09:34:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.