Related papers: Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

URL: http://arxiv.org/abs/2406.11431v1
Date: Mon, 17 Jun 2024 11:36:39 GMT
Title: Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Authors: Wenkai Yang, Shiqi Shen, Guangyao Shen, Zhi Gong, Yankai Lin,
Abstract summary: We investigate whether there exists an issue of weak-to-strong deception. We find that the deception phenomenon may intensify as the capability gap between weak and strong models increases. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.
Score: 29.74441821506767
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Superalignment, where humans are weak supervisors of superhuman models, has become an important and widely discussed issue in the current era of rapid development of Large Language Models (LLMs). The recent work preliminarily studies this problem by using weak models to supervise strong models. It discovers that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models may deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We then take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment targets conflicting with each other (e.g., helpfulness v.s. harmlessness). Such a conflict is likely to cause strong models to deceive weak models in one alignment dimension to gain high reward in other alignment dimension. Our experiments on both the reward modeling task and the preference optimization scenario indicate: (1) the weak-to-strong deception exists; (2) the deception phenomenon may intensify as the capability gap between weak and strong models increases. We also discuss potential solutions and find bootstrapping with an intermediate model can mitigate the deception to some extent. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.

Related papers

How to Mitigate Overfitting in Weak-to-strong Generalization? [50.37526669608372]
Weak-to-strong generalization aims to elicit the capabilities of strong models through weak supervisors. Strong models exhibit significant overfitting in weak-to-strong generalization. We propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions.
arXiv Detail & Related papers (2025-03-06T09:32:39Z)
Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions [12.956498486569103]
Weak-to-Strong Generalization (W2SG) serves as an important analogy for understanding how humans might guide superhuman intelligence in the future. We show that W2SG can be characterized using kernels derived from the principal components of weak and strong models' internal representations.
arXiv Detail & Related papers (2025-02-02T01:11:51Z)
Debate Helps Weak-to-Strong Generalization [68.70065254564642]
We investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model. Experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment.
arXiv Detail & Related papers (2025-01-21T05:36:13Z)
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning [10.752609242505953]
Traditional alignment methods rely on human feedback to fine-tune models. Superhuman models whose outputs may surpass human understanding poses significant challenges. Recent works use weak supervisors to elicit knowledge from much stronger models.
arXiv Detail & Related papers (2024-10-16T14:40:32Z)
Effects of Scale on Language Model Robustness [7.725206196110384]
We show that adversarially trained larger models generalize faster and better to modified attacks not seen during training when compared with smaller models. We also analyze the offense/defense balance of increasing compute, finding parity in some settings and an advantage for offense in others.
arXiv Detail & Related papers (2024-07-25T17:26:41Z)
Quantifying the Gain in Weak-to-Strong Generalization [14.453654853392619]
We show that the improvement in performance achieved by strong models over their weaker counterparts is quantified by the misfit error incurred by the strong model on labels generated by the weaker model. For instance, we can predict the amount by which the strong model will improve over the weak model, and also choose among different weak models to train the strong model, based on its misfit error.
arXiv Detail & Related papers (2024-05-24T00:14:16Z)
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models [55.919653720979824]
This paper focuses on the concept of weak-to-strong generalization, which involves using a weaker model to supervise a stronger one. We introduce a novel and adaptively adjustable loss function for weak-to-strong supervision. Our approach not only exceeds the performance benchmarks set by strong-to-strong generalization but also surpasses the outcomes of fine-tuning strong models with whole datasets.
arXiv Detail & Related papers (2024-02-06T06:30:34Z)
Rethinking Robustness of Model Attributions [24.317595434521504]
We show that many attribution methods are fragile and have proposed improvements in either these methods or the model training. We observe two main causes for fragile attributions: first, the existing metrics of robustness over-penalize even reasonable local shifts in attribution. We propose simple ways to strengthen existing metrics and attribution methods that incorporate locality of pixels in robustness metrics and diversity of pixel locations in attributions.
arXiv Detail & Related papers (2023-12-16T20:20:38Z)
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision [55.196139002977525]
Superhuman models will behave in complex ways too difficult for humans to reliably evaluate. Can weak model supervision elicit the full capabilities of a much stronger model? We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors.
arXiv Detail & Related papers (2023-12-14T23:07:33Z)
Fairness Increases Adversarial Vulnerability [50.90773979394264]
This paper shows the existence of a dichotomy between fairness and robustness, and analyzes when achieving fairness decreases the model robustness to adversarial samples. Experiments on non-linear models and different architectures validate the theoretical findings in multiple vision domains. The paper proposes a simple, yet effective, solution to construct models achieving good tradeoffs between fairness and robustness.
arXiv Detail & Related papers (2022-11-21T19:55:35Z)
"What's in the box?!": Deflecting Adversarial Attacks by Randomly Deploying Adversarially-Disjoint Models [71.91835408379602]
adversarial examples have been long considered a real threat to machine learning models. We propose an alternative deployment-based defense paradigm that goes beyond the traditional white-box and black-box threat models.
arXiv Detail & Related papers (2021-02-09T20:07:13Z)
Orthogonal Deep Models As Defense Against Black-Box Attacks [71.23669614195195]
We study the inherent weakness of deep models in black-box settings where the attacker may develop the attack using a model similar to the targeted model. We introduce a novel gradient regularization scheme that encourages the internal representation of a deep model to be orthogonal to another. We verify the effectiveness of our technique on a variety of large-scale models.
arXiv Detail & Related papers (2020-06-26T08:29:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.