Weak-to-Strong Generalization under Distribution Shifts
- URL: http://arxiv.org/abs/2510.21332v1
- Date: Fri, 24 Oct 2025 10:46:50 GMT
- Title: Weak-to-Strong Generalization under Distribution Shifts
- Authors: Myeongho Jeon, Jan Sobotka, Suhwan Choi, Maria Brbić,
- Abstract summary: We propose RAVEN, a robust weak-to-strong generalization framework.<n>RAVEN learns the optimal combinations of weak models in addition to parameters of the strong model.<n>Our results show that RAVEN assigns higher weights to more accurate weak models, demonstrating its ability to automatically identify trustworthy supervision.
- Score: 6.711930932187631
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As future superhuman models become increasingly complex, accurately supervising their behavior may exceed human capabilities. Recent works have demonstrated that in such scenarios, weak models can effectively supervise strong models, a phenomenon known as weak-to-strong generalization. However, we find that naive weak-to-strong generalization fails under distribution shifts, often leading to worse performance of the strong model than its weak supervisors. To address this, we propose RAVEN, a robust weak-to-strong generalization framework that dynamically learns the optimal combinations of weak models in addition to parameters of the strong model. We demonstrate the effectiveness of RAVEN on image classification, text classification, and preference alignment tasks. RAVEN outperforms alternative baselines by over 30% on out-of-distribution tasks while matching or surpassing existing methods on in-distribution tasks. Moreover, our results show that RAVEN assigns higher weights to more accurate weak models, demonstrating its ability to automatically identify trustworthy supervision.
Related papers
- Contrastive Weak-to-strong Generalization [50.5986177336082]
We propose Contrastive Weak-to-Strong Generalization (ConG) to advance weak-to-strong generalization.<n>This framework employs contrastive decoding between pre- and post-alignment weak models to generate higher-quality samples.
arXiv Detail & Related papers (2025-10-09T07:37:23Z) - Weak-to-Strong Generalization with Failure Trajectories: A Tree-based Approach to Elicit Optimal Policy in Strong Models [26.393123295267642]
Weak-to-Strong generalization (W2SG) is a new trend to elicit the full capabilities of a strong model with supervision from a weak model.<n>We fine-tune a strong model with trajectories of intermediate actions generated by a weak model.<n>Our empirical evaluations demonstrate substantial improvements in reasoning and decision-making capabilities across diverse task domains.
arXiv Detail & Related papers (2025-07-25T00:17:09Z) - How to Mitigate Overfitting in Weak-to-strong Generalization? [50.37526669608372]
Weak-to-strong generalization aims to elicit the capabilities of strong models through weak supervisors.<n>Strong models exhibit significant overfitting in weak-to-strong generalization.<n>We propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions.
arXiv Detail & Related papers (2025-03-06T09:32:39Z) - Relating Misfit to Gain in Weak-to-Strong Generalization Beyond the Squared Loss [4.4505368723466585]
We study weak-to-strong generalization for convex combinations of $k$ strong models in the strong class.<n>We obtain a similar misfit-based characterization of performance gain, upto an additional error term that vanishes as $k$ gets large.
arXiv Detail & Related papers (2025-01-31T12:57:58Z) - Debate Helps Weak-to-Strong Generalization [68.70065254564642]
We investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision.<n>We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model.<n>Experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment.
arXiv Detail & Related papers (2025-01-21T05:36:13Z) - Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization [68.62228569439478]
We investigate whether there exists an issue of weak-to-strong deception.<n>We find that the deception intensifies as the capability gap between weak and strong models increases.<n>Our work highlights the urgent need to pay more attention to the true reliability of superalignment.
arXiv Detail & Related papers (2024-06-17T11:36:39Z) - Quantifying the Gain in Weak-to-Strong Generalization [14.453654853392619]
We show that the improvement in performance achieved by strong models over their weaker counterparts is quantified by the misfit error incurred by the strong model on labels generated by the weaker model.
For instance, we can predict the amount by which the strong model will improve over the weak model, and also choose among different weak models to train the strong model, based on its misfit error.
arXiv Detail & Related papers (2024-05-24T00:14:16Z) - Vision Superalignment: Weak-to-Strong Generalization for Vision
Foundation Models [55.919653720979824]
This paper focuses on the concept of weak-to-strong generalization, which involves using a weaker model to supervise a stronger one.
We introduce a novel and adaptively adjustable loss function for weak-to-strong supervision.
Our approach not only exceeds the performance benchmarks set by strong-to-strong generalization but also surpasses the outcomes of fine-tuning strong models with whole datasets.
arXiv Detail & Related papers (2024-02-06T06:30:34Z) - Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
Supervision [55.196139002977525]
Superhuman models will behave in complex ways too difficult for humans to reliably evaluate.
Can weak model supervision elicit the full capabilities of a much stronger model?
We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors.
arXiv Detail & Related papers (2023-12-14T23:07:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.