How to Mitigate Overfitting in Weak-to-strong Generalization?
- URL: http://arxiv.org/abs/2503.04249v1
- Date: Thu, 06 Mar 2025 09:32:39 GMT
- Title: How to Mitigate Overfitting in Weak-to-strong Generalization?
- Authors: Junhao Shi, Qinyuan Cheng, Zhaoye Fei, Yining Zheng, Qipeng Guo, Xipeng Qiu,
- Abstract summary: Weak-to-strong generalization aims to elicit the capabilities of strong models through weak supervisors.<n>Strong models exhibit significant overfitting in weak-to-strong generalization.<n>We propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions.
- Score: 50.37526669608372
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Aligning powerful AI models on tasks that surpass human evaluation capabilities is the central problem of \textbf{superalignment}. To address this problem, weak-to-strong generalization aims to elicit the capabilities of strong models through weak supervisors and ensure that the behavior of strong models aligns with the intentions of weak supervisors without unsafe behaviors such as deception. Although weak-to-strong generalization exhibiting certain generalization capabilities, strong models exhibit significant overfitting in weak-to-strong generalization: Due to the strong fit ability of strong models, erroneous labels from weak supervisors may lead to overfitting in strong models. In addition, simply filtering out incorrect labels may lead to a degeneration in question quality, resulting in a weak generalization ability of strong models on hard questions. To mitigate overfitting in weak-to-strong generalization, we propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions. Experimental results in three series of large language models and two mathematical benchmarks demonstrate that our framework significantly improves PGR compared to naive weak-to-strong generalization, even achieving up to 100\% PGR on some models.
Related papers
- Understanding the Capabilities and Limitations of Weak-to-Strong Generalization [40.793180521446466]
We provide theoretical insights into weak-to-strong generalization.
We show that the weak model should demonstrate strong generalization performance and maintain well-calibrated predictions.
We extend the work of Charikar et al. (2024) to a loss function based on Kullback-Leibler divergence.
arXiv Detail & Related papers (2025-02-03T15:48:28Z) - Debate Helps Weak-to-Strong Generalization [68.70065254564642]
We investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision.<n>We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model.<n>Experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment.
arXiv Detail & Related papers (2025-01-21T05:36:13Z) - Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models [29.11210975481761]
We study whether a stronger model can inherit trustworthiness properties when fine-tuned on a weaker model's outputs.
Our work provides valuable insights into the potential and limitations of weak-to-strong generalization.
arXiv Detail & Related papers (2024-12-31T12:40:02Z) - Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization [68.62228569439478]
We investigate whether there exists an issue of weak-to-strong deception.<n>We find that the deception intensifies as the capability gap between weak and strong models increases.<n>Our work highlights the urgent need to pay more attention to the true reliability of superalignment.
arXiv Detail & Related papers (2024-06-17T11:36:39Z) - Quantifying the Gain in Weak-to-Strong Generalization [14.453654853392619]
We show that the improvement in performance achieved by strong models over their weaker counterparts is quantified by the misfit error incurred by the strong model on labels generated by the weaker model.
For instance, we can predict the amount by which the strong model will improve over the weak model, and also choose among different weak models to train the strong model, based on its misfit error.
arXiv Detail & Related papers (2024-05-24T00:14:16Z) - Co-Supervised Learning: Improving Weak-to-Strong Generalization with
Hierarchical Mixture of Experts [81.37287967870589]
We propose to harness a diverse set of specialized teachers, instead of a single generalist one, that collectively supervises the strong student.
Our approach resembles the classical hierarchical mixture of experts, with two components tailored for co-supervision.
We validate the proposed method through visual recognition tasks on the OpenAI weak-to-strong benchmark and additional multi-domain datasets.
arXiv Detail & Related papers (2024-02-23T18:56:11Z) - Vision Superalignment: Weak-to-Strong Generalization for Vision
Foundation Models [55.919653720979824]
This paper focuses on the concept of weak-to-strong generalization, which involves using a weaker model to supervise a stronger one.
We introduce a novel and adaptively adjustable loss function for weak-to-strong supervision.
Our approach not only exceeds the performance benchmarks set by strong-to-strong generalization but also surpasses the outcomes of fine-tuning strong models with whole datasets.
arXiv Detail & Related papers (2024-02-06T06:30:34Z) - Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
Supervision [55.196139002977525]
Superhuman models will behave in complex ways too difficult for humans to reliably evaluate.
Can weak model supervision elicit the full capabilities of a much stronger model?
We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors.
arXiv Detail & Related papers (2023-12-14T23:07:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.