Related papers: Quantifying the Gain in Weak-to-Strong Generalization

Quantifying the Gain in Weak-to-Strong Generalization

URL: http://arxiv.org/abs/2405.15116v1
Date: Fri, 24 May 2024 00:14:16 GMT
Title: Quantifying the Gain in Weak-to-Strong Generalization
Authors: Moses Charikar, Chirag Pabbaraju, Kirankumar Shiragur,
Abstract summary: We show that the improvement in performance achieved by strong models over their weaker counterparts is quantified by the misfit error incurred by the strong model on labels generated by the weaker model. For instance, we can predict the amount by which the strong model will improve over the weak model, and also choose among different weak models to train the strong model, based on its misfit error.
Score: 14.453654853392619
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models have shown capabilities that are extraordinary and near-superhuman. These models operate with such complexity that reliably evaluating and aligning them proves challenging for humans. This leads to the natural question: can guidance from weak models (like humans) adequately direct the capabilities of strong models? In a recent and somewhat surprising work, Burns et al. (2023) empirically demonstrated that when strong models (like GPT-4) are finetuned using labels generated by weak supervisors (like GPT-2), the strong models outperform their weaker counterparts -- a phenomenon they term weak-to-strong generalization. In this work, we present a theoretical framework for understanding weak-to-strong generalization. Specifically, we show that the improvement in performance achieved by strong models over their weaker counterparts is quantified by the misfit error incurred by the strong model on labels generated by the weaker model. Our theory reveals several curious algorithmic insights. For instance, we can predict the amount by which the strong model will improve over the weak model, and also choose among different weak models to train the strong model, based on its misfit error. We validate our theoretical findings through various empirical assessments.

Related papers

How to Mitigate Overfitting in Weak-to-strong Generalization? [50.37526669608372]
Weak-to-strong generalization aims to elicit the capabilities of strong models through weak supervisors. Strong models exhibit significant overfitting in weak-to-strong generalization. We propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions.
arXiv Detail & Related papers (2025-03-06T09:32:39Z)
Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions [12.956498486569103]
Weak-to-Strong Generalization (W2SG) serves as an important analogy for understanding how humans might guide superhuman intelligence in the future. We show that W2SG can be characterized using kernels derived from the principal components of weak and strong models' internal representations.
arXiv Detail & Related papers (2025-02-02T01:11:51Z)
Relating Misfit to Gain in Weak-to-Strong Generalization Beyond the Squared Loss [4.4505368723466585]
We study weak-to-strong generalization for convex combinations of $k$ strong models in the strong class. We obtain a similar misfit-based characterization of performance gain, upto an additional error term that vanishes as $k$ gets large.
arXiv Detail & Related papers (2025-01-31T12:57:58Z)
Debate Helps Weak-to-Strong Generalization [68.70065254564642]
We investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model. Experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment.
arXiv Detail & Related papers (2025-01-21T05:36:13Z)
Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model [28.569089876442682]
This work is inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor. We propose Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model.
arXiv Detail & Related papers (2024-10-24T11:06:29Z)
Effects of Scale on Language Model Robustness [7.725206196110384]
We show that adversarially trained larger models generalize faster and better to modified attacks not seen during training when compared with smaller models. We also analyze the offense/defense balance of increasing compute, finding parity in some settings and an advantage for offense in others.
arXiv Detail & Related papers (2024-07-25T17:26:41Z)
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization [68.62228569439478]
We investigate whether there exists an issue of weak-to-strong deception. We find that the deception intensifies as the capability gap between weak and strong models increases. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.
arXiv Detail & Related papers (2024-06-17T11:36:39Z)
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models [55.919653720979824]
This paper focuses on the concept of weak-to-strong generalization, which involves using a weaker model to supervise a stronger one. We introduce a novel and adaptively adjustable loss function for weak-to-strong supervision. Our approach not only exceeds the performance benchmarks set by strong-to-strong generalization but also surpasses the outcomes of fine-tuning strong models with whole datasets.
arXiv Detail & Related papers (2024-02-06T06:30:34Z)
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision [55.196139002977525]
Superhuman models will behave in complex ways too difficult for humans to reliably evaluate. Can weak model supervision elicit the full capabilities of a much stronger model? We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors.
arXiv Detail & Related papers (2023-12-14T23:07:33Z)
Generative Modeling Helps Weak Supervision (and Vice Versa) [87.62271390571837]
We propose a model fusing weak supervision and generative adversarial networks. It captures discrete variables in the data alongside the weak supervision derived label estimate. It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels.
arXiv Detail & Related papers (2022-03-22T20:24:21Z)
Predicting on the Edge: Identifying Where a Larger Model Does Better [61.793778186198864]
We show that large models have the largest improvement on examples where the small model is most uncertain. We show that a switcher model which defers examples to a larger model when a small model is uncertain can achieve striking improvements in performance and resource usage.
arXiv Detail & Related papers (2022-02-15T18:53:14Z)
Clustering Effect of (Linearized) Adversarial Robust Models [60.25668525218051]
We propose a novel understanding of adversarial robustness and apply it on more tasks including domain adaption and robustness boosting. Experimental evaluations demonstrate the rationality and superiority of our proposed clustering strategy.
arXiv Detail & Related papers (2021-11-25T05:51:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.