Selective Weak-to-Strong Generalization
- URL: http://arxiv.org/abs/2511.14166v1
- Date: Tue, 18 Nov 2025 06:03:25 GMT
- Title: Selective Weak-to-Strong Generalization
- Authors: Hao Lang, Fei Huang, Yongbin Li,
- Abstract summary: We propose a selective W2SG framework to avoid using weak supervision when unnecessary.<n>We train a binary classifier P(IK) to identify questions that a strong model can answer and use its self-generated labels for alignment.
- Score: 75.5234414246513
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Future superhuman models will surpass the ability of humans and humans will only be able to \textit{weakly} supervise superhuman models. To alleviate the issue of lacking high-quality data for model alignment, some works on weak-to-strong generalization (W2SG) finetune a strong pretrained model with a weak supervisor so that it can generalize beyond weak supervision. However, the invariable use of weak supervision in existing methods exposes issues in robustness, with a proportion of weak labels proving harmful to models. In this paper, we propose a selective W2SG framework to avoid using weak supervision when unnecessary. We train a binary classifier P(IK) to identify questions that a strong model can answer and use its self-generated labels for alignment. We further refine weak labels with a graph smoothing method. Extensive experiments on three benchmarks show that our method consistently outperforms competitive baselines. Further analyses show that P(IK) can generalize across tasks and difficulties, which indicates selective W2SG can help superalignment.
Related papers
- Weak-to-Strong Generalization under Distribution Shifts [6.711930932187631]
We propose RAVEN, a robust weak-to-strong generalization framework.<n>RAVEN learns the optimal combinations of weak models in addition to parameters of the strong model.<n>Our results show that RAVEN assigns higher weights to more accurate weak models, demonstrating its ability to automatically identify trustworthy supervision.
arXiv Detail & Related papers (2025-10-24T10:46:50Z) - How to Mitigate Overfitting in Weak-to-strong Generalization? [50.37526669608372]
Weak-to-strong generalization aims to elicit the capabilities of strong models through weak supervisors.<n>Strong models exhibit significant overfitting in weak-to-strong generalization.<n>We propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions.
arXiv Detail & Related papers (2025-03-06T09:32:39Z) - Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions [12.956498486569103]
Weak-to-Strong Generalization (W2SG) serves as an important analogy for understanding how humans might guide superhuman intelligence in the future.<n>We show that W2SG can be characterized using kernels derived from the principal components of weak and strong models' internal representations.
arXiv Detail & Related papers (2025-02-02T01:11:51Z) - Debate Helps Weak-to-Strong Generalization [68.70065254564642]
We investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision.<n>We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model.<n>Experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment.
arXiv Detail & Related papers (2025-01-21T05:36:13Z) - Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization [68.62228569439478]
We investigate whether there exists an issue of weak-to-strong deception.<n>We find that the deception intensifies as the capability gap between weak and strong models increases.<n>Our work highlights the urgent need to pay more attention to the true reliability of superalignment.
arXiv Detail & Related papers (2024-06-17T11:36:39Z) - Quantifying the Gain in Weak-to-Strong Generalization [14.453654853392619]
We show that the improvement in performance achieved by strong models over their weaker counterparts is quantified by the misfit error incurred by the strong model on labels generated by the weaker model.
For instance, we can predict the amount by which the strong model will improve over the weak model, and also choose among different weak models to train the strong model, based on its misfit error.
arXiv Detail & Related papers (2024-05-24T00:14:16Z) - Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
Supervision [55.196139002977525]
Superhuman models will behave in complex ways too difficult for humans to reliably evaluate.
Can weak model supervision elicit the full capabilities of a much stronger model?
We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors.
arXiv Detail & Related papers (2023-12-14T23:07:33Z) - Generative Modeling Helps Weak Supervision (and Vice Versa) [87.62271390571837]
We propose a model fusing weak supervision and generative adversarial networks.
It captures discrete variables in the data alongside the weak supervision derived label estimate.
It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels.
arXiv Detail & Related papers (2022-03-22T20:24:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.