Weak-to-Strong Generalization Even in Random Feature Networks, Provably
- URL: http://arxiv.org/abs/2503.02877v1
- Date: Tue, 04 Mar 2025 18:58:00 GMT
- Title: Weak-to-Strong Generalization Even in Random Feature Networks, Provably
- Authors: Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora, Zhiyuan Li, Nathan Srebro,
- Abstract summary: We show that weak-to-strong generalization does not require a strong learner like GPT-4.<n>We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the weak teacher.
- Score: 54.68030827799126
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Weak-to-Strong Generalization (Burns et al., 2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong learner like GPT-4. We consider student and teacher that are random feature models, described by two-layer networks with a random and fixed bottom layer and a trained top layer. A "weak" teacher, with a small number of units (i.e. random features), is trained on the population, and a "strong" student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.
Related papers
- Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization [69.96794098855938]
Weak-to-strong generalization (W2SG) offers a promising framework for supervising increasingly capable language models (LLMs)
Traditional W2SG methods rely on passive learning, where a weak teacher provides noisy demonstrations to train a strong student.
We introduce Alice, a framework that leverages complementary knowledge between teacher and student to enhance the learning process.
arXiv Detail & Related papers (2025-04-09T22:33:06Z) - Provable Weak-to-Strong Generalization via Benign Overfitting [3.4652800888823294]
We consider the inverted situation, where a weak teacher supervises a strong student with imperfect pseudolabels.<n>We theoretically investigate weak-to-strong generalization for binary and multilabel classification.<n>Our techniques should eventually extend to weak-to-strong multiclass classification.
arXiv Detail & Related papers (2024-10-06T22:10:50Z) - Co-Supervised Learning: Improving Weak-to-Strong Generalization with
Hierarchical Mixture of Experts [81.37287967870589]
We propose to harness a diverse set of specialized teachers, instead of a single generalist one, that collectively supervises the strong student.
Our approach resembles the classical hierarchical mixture of experts, with two components tailored for co-supervision.
We validate the proposed method through visual recognition tasks on the OpenAI weak-to-strong benchmark and additional multi-domain datasets.
arXiv Detail & Related papers (2024-02-23T18:56:11Z) - Switching Temporary Teachers for Semi-Supervised Semantic Segmentation [45.20519672287495]
The teacher-student framework, prevalent in semi-supervised semantic segmentation, mainly employs the exponential moving average (EMA) to update a single teacher's weights based on the student's.
This paper introduces Dual Teacher, a simple yet effective approach that employs dual temporary teachers aiming to alleviate the coupling problem for the student.
arXiv Detail & Related papers (2023-10-28T08:49:16Z) - How does GPT-2 compute greater-than?: Interpreting mathematical
abilities in a pre-trained language model [52.92472140375308]
We use mechanistic interpretability techniques to explain the mathematical abilities of GPT-2 small.
We show that GPT-2 small's final multi-layer perceptrons boost the probability of end years greater than the start year.
Our results suggest that GPT-2 small computes greater-than using a complex but general mechanism.
arXiv Detail & Related papers (2023-04-30T21:44:21Z) - Self-Training with Differentiable Teacher [80.62757989797095]
Self-training achieves enormous success in various semi-supervised and weakly-supervised learning tasks.
The method can be interpreted as a teacher-student framework, where the teacher generates pseudo-labels, and the student makes predictions.
We propose ours, short for differentiable self-training, that treats teacher-student as a Stackelberg game.
arXiv Detail & Related papers (2021-09-15T02:06:13Z) - Does Knowledge Distillation Really Work? [106.38447017262183]
We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood.
We identify difficulties in optimization as a key reason for why the student is unable to match the teacher.
arXiv Detail & Related papers (2021-06-10T17:44:02Z) - Subclass Distillation [94.18870689772544]
We show that it is possible to transfer most of the generalization ability of a teacher to a student.
For datasets where there are known, natural subclasses we demonstrate that the teacher learns similar subclasses.
For clickthrough datasets where the subclasses are unknown we demonstrate that subclass distillation allows the student to learn faster and better.
arXiv Detail & Related papers (2020-02-10T16:45:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.