Improving Weak-to-Strong Generalization with Reliability-Aware Alignment
- URL: http://arxiv.org/abs/2406.19032v1
- Date: Thu, 27 Jun 2024 09:37:34 GMT
- Title: Improving Weak-to-Strong Generalization with Reliability-Aware Alignment
- Authors: Yue Guo, Yi Yang,
- Abstract summary: Large language models (LLMs) are rapidly advancing and surpassing human abilities on many natural language tasks.
"Super-alignment" problem requires enhancing weak-to-strong generalization.
We propose an approach to improve weak-to-strong generalization by involving the reliability of weak supervision signals.
- Score: 22.754757518792395
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are now rapidly advancing and surpassing human abilities on many natural language tasks. However, aligning these super-human LLMs with human knowledge remains challenging because the supervision signals from human annotators may be wrong. This issue, known as the "super-alignment" problem, requires enhancing weak-to-strong generalization, where a strong LLM must generalize from imperfect supervision provided by a weaker source. To address this issue, we propose an approach to improve weak-to-strong generalization by involving the reliability of weak supervision signals in the alignment process. In our method, we query the weak supervisor for multiple answers, estimate the answer reliability, and enhance the alignment process by filtering out uncertain data or re-weighting reliable data. Experiments on four datasets demonstrate that our methods effectively identify the quality of weak labels and significantly enhance weak-to-strong generalization. Our work presents effective techniques for error-robust model alignment, reducing error propagation from noisy supervision and enhancing the accuracy and reliability of LLMs. Codes are publicly available at http://github.com/Irenehere/ReliableAlignment.
Related papers
- Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning [10.752609242505953]
Traditional alignment methods rely on human feedback to fine-tune models.
Superhuman models whose outputs may surpass human understanding poses significant challenges.
Recent works use weak supervisors to elicit knowledge from much stronger models.
arXiv Detail & Related papers (2024-10-16T14:40:32Z) - EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM? [28.43206274079919]
We propose an innovative approach to weak-to-strong (w2s) generalization.
We show that weak models trained on simpler tasks collaboratively supervise stronger models on more complex tasks.
We observe an improvement of up to 14% over existing baselines and average improvements of 5% and 4% for binary classification and generative tasks.
arXiv Detail & Related papers (2024-10-06T18:06:42Z) - Your Weak LLM is Secretly a Strong Teacher for Alignment [19.33906256866585]
Existing alignment frameworks present constraints either in the form of expensive human effort or high computational costs.
This paper explores a promising middle ground, where we employ a weak LLM that is significantly less resource-intensive than top-tier models.
We show that weak LLMs can provide feedback that rivals or even exceeds that of fully human-annotated data.
arXiv Detail & Related papers (2024-09-13T13:24:52Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - TRACE: TRansformer-based Attribution using Contrastive Embeddings in LLMs [50.259001311894295]
We propose a novel TRansformer-based Attribution framework using Contrastive Embeddings called TRACE.
We show that TRACE significantly improves the ability to attribute sources accurately, making it a valuable tool for enhancing the reliability and trustworthiness of large language models.
arXiv Detail & Related papers (2024-07-06T07:19:30Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - A statistical framework for weak-to-strong generalization [38.55982453315567]
It is unclear whether it is possible to align (stronger) LLMs with superhuman capabilities with (weaker) human feedback without degrading their capabilities.
This is an instance of the weak-to-strong generalization problem: using weaker (less capable) feedback to train a stronger (more capable) model.
We prove that weak-to-strong generalization is possible by eliciting latent knowledge from pre-trained LLMs.
arXiv Detail & Related papers (2024-05-25T13:54:05Z) - CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs.
Our studies reveal a new and universal safety vulnerability of these models against code input.
We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z) - A General Framework for Learning from Weak Supervision [93.89870459388185]
This paper introduces a general framework for learning from weak supervision (GLWS) with a novel algorithm.
Central to GLWS is an Expectation-Maximization (EM) formulation, adeptly accommodating various weak supervision sources.
We also present an advanced algorithm that significantly simplifies the EM computational demands.
arXiv Detail & Related papers (2024-02-02T21:48:50Z) - Generative Modeling Helps Weak Supervision (and Vice Versa) [87.62271390571837]
We propose a model fusing weak supervision and generative adversarial networks.
It captures discrete variables in the data alongside the weak supervision derived label estimate.
It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels.
arXiv Detail & Related papers (2022-03-22T20:24:21Z) - Fine-Tuning Pre-trained Language Model with Weak Supervision: A
Contrastive-Regularized Self-Training Approach [46.76317056976196]
Fine-tuned pre-trained language models (LMs) have achieved enormous success in many natural language processing (NLP) tasks.
We study the problem of fine-tuning pre-trained LMs using only weak supervision, without any labeled data.
We develop a contrastive self-training framework, COSINE, to enable fine-tuning LMs with weak supervision.
arXiv Detail & Related papers (2020-10-15T15:55:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.