Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
- URL: http://arxiv.org/abs/2501.00418v1
- Date: Tue, 31 Dec 2024 12:40:02 GMT
- Title: Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
- Authors: Martin Pawelczyk, Lillian Sun, Zhenting Qi, Aounon Kumar, Himabindu Lakkaraju,
- Abstract summary: We study whether a stronger model can inherit trustworthiness properties when fine-tuned on a weaker model's outputs.<n>Our work provides valuable insights into the potential and limitations of weak-to-strong generalization.
- Score: 29.11210975481761
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid proliferation of generative AI, especially large language models, has led to their integration into a variety of applications. A key phenomenon known as weak-to-strong generalization - where a strong model trained on a weak model's outputs surpasses the weak model in task performance - has gained significant attention. Yet, whether critical trustworthiness properties such as robustness, fairness, and privacy can generalize similarly remains an open question. In this work, we study this question by examining if a stronger model can inherit trustworthiness properties when fine-tuned on a weaker model's outputs, a process we term weak-to-strong trustworthiness generalization. To address this, we introduce two foundational training strategies: 1) Weak Trustworthiness Finetuning (Weak TFT), which leverages trustworthiness regularization during the fine-tuning of the weak model, and 2) Weak and Weak-to-Strong Trustworthiness Finetuning (Weak+WTS TFT), which extends regularization to both weak and strong models. Our experimental evaluation on real-world datasets reveals that while some trustworthiness properties, such as fairness, adversarial, and OOD robustness, show significant improvement in transfer when both models were regularized, others like privacy do not exhibit signs of weak-to-strong trustworthiness. As the first study to explore trustworthiness generalization via weak-to-strong generalization, our work provides valuable insights into the potential and limitations of weak-to-strong generalization.
Related papers
- A biologically Inspired Trust Model for Open Multi-Agent Systems that is Resilient to Rapid Performance Fluctuations [0.0]
Existing trust models face challenges related to agent mobility, changing behaviors, and the cold start problem.
We introduce a biologically inspired trust model in which trustees assess their own capabilities and store trust data locally.
This design improves mobility support, reduces communication overhead, resists disinformation, and preserves privacy.
arXiv Detail & Related papers (2025-04-17T08:21:54Z) - How to Mitigate Overfitting in Weak-to-strong Generalization? [50.37526669608372]
Weak-to-strong generalization aims to elicit the capabilities of strong models through weak supervisors.
Strong models exhibit significant overfitting in weak-to-strong generalization.
We propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions.
arXiv Detail & Related papers (2025-03-06T09:32:39Z) - Debate Helps Weak-to-Strong Generalization [68.70065254564642]
We investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision.
We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model.
Experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment.
arXiv Detail & Related papers (2025-01-21T05:36:13Z) - Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization [68.62228569439478]
We investigate whether there exists an issue of weak-to-strong deception.
We find that the deception intensifies as the capability gap between weak and strong models increases.
Our work highlights the urgent need to pay more attention to the true reliability of superalignment.
arXiv Detail & Related papers (2024-06-17T11:36:39Z) - Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models [14.5291643644017]
We introduce the concept of Confidence-Probability Alignment.
We probe the alignment between models' internal and expressed confidence.
Among the models analyzed, OpenAI's GPT-4 showed the strongest confidence-probability alignment.
arXiv Detail & Related papers (2024-05-25T15:42:04Z) - Vision Superalignment: Weak-to-Strong Generalization for Vision
Foundation Models [55.919653720979824]
This paper focuses on the concept of weak-to-strong generalization, which involves using a weaker model to supervise a stronger one.
We introduce a novel and adaptively adjustable loss function for weak-to-strong supervision.
Our approach not only exceeds the performance benchmarks set by strong-to-strong generalization but also surpasses the outcomes of fine-tuning strong models with whole datasets.
arXiv Detail & Related papers (2024-02-06T06:30:34Z) - Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
Supervision [55.196139002977525]
Superhuman models will behave in complex ways too difficult for humans to reliably evaluate.
Can weak model supervision elicit the full capabilities of a much stronger model?
We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors.
arXiv Detail & Related papers (2023-12-14T23:07:33Z) - How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities [37.14654106278984]
We conduct an adversarial assessment of open-source Large Language Models (LLMs) on trustworthiness.
We propose advCoU, an extended Chain of Utterances-based (CoU) prompting strategy by incorporating carefully crafted malicious demonstrations for trustworthiness attack.
Our experiments encompass recent and representative series of open-source LLMs, including Vicuna, MPT, Falcon, Mistral, and Llama 2.
arXiv Detail & Related papers (2023-11-15T23:33:07Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - FRSUM: Towards Faithful Abstractive Summarization via Enhancing Factual
Robustness [56.263482420177915]
We study the faithfulness of existing systems from a new perspective of factual robustness.
We propose a novel training strategy, namely FRSUM, which teaches the model to defend against both explicit adversarial samples and implicit factual adversarial perturbations.
arXiv Detail & Related papers (2022-11-01T06:09:00Z) - Balancing Robustness and Sensitivity using Feature Contrastive Learning [95.86909855412601]
Methods that promote robustness can hurt the model's sensitivity to rare or underrepresented patterns.
We propose Feature Contrastive Learning (FCL) that encourages a model to be more sensitive to the features that have higher contextual utility.
arXiv Detail & Related papers (2021-05-19T20:53:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.