Improving Weak-to-Strong Generalization with Reliability-Aware Alignment
- URL: http://arxiv.org/abs/2406.19032v1
- Date: Thu, 27 Jun 2024 09:37:34 GMT
- Title: Improving Weak-to-Strong Generalization with Reliability-Aware Alignment
- Authors: Yue Guo, Yi Yang,
- Abstract summary: Large language models (LLMs) are rapidly advancing and surpassing human abilities on many natural language tasks.
"Super-alignment" problem requires enhancing weak-to-strong generalization.
We propose an approach to improve weak-to-strong generalization by involving the reliability of weak supervision signals.
- Score: 22.754757518792395
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are now rapidly advancing and surpassing human abilities on many natural language tasks. However, aligning these super-human LLMs with human knowledge remains challenging because the supervision signals from human annotators may be wrong. This issue, known as the "super-alignment" problem, requires enhancing weak-to-strong generalization, where a strong LLM must generalize from imperfect supervision provided by a weaker source. To address this issue, we propose an approach to improve weak-to-strong generalization by involving the reliability of weak supervision signals in the alignment process. In our method, we query the weak supervisor for multiple answers, estimate the answer reliability, and enhance the alignment process by filtering out uncertain data or re-weighting reliable data. Experiments on four datasets demonstrate that our methods effectively identify the quality of weak labels and significantly enhance weak-to-strong generalization. Our work presents effective techniques for error-robust model alignment, reducing error propagation from noisy supervision and enhancing the accuracy and reliability of LLMs. Codes are publicly available at http://github.com/Irenehere/ReliableAlignment.
Related papers
- Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - TRACE: TRansformer-based Attribution using Contrastive Embeddings in LLMs [50.259001311894295]
We propose a novel TRansformer-based Attribution framework using Contrastive Embeddings called TRACE.
We show that TRACE significantly improves the ability to attribute sources accurately, making it a valuable tool for enhancing the reliability and trustworthiness of large language models.
arXiv Detail & Related papers (2024-07-06T07:19:30Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - A statistical framework for weak-to-strong generalization [38.55982453315567]
It is unclear whether it is possible to align (stronger) LLMs with superhuman capabilities with (weaker) human feedback without degrading their capabilities.
This is an instance of the weak-to-strong generalization problem: using weaker (less capable) feedback to train a stronger (more capable) model.
We prove that weak-to-strong generalization is possible by eliciting latent knowledge from pre-trained LLMs.
arXiv Detail & Related papers (2024-05-25T13:54:05Z) - CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs.
Our studies reveal a new and universal safety vulnerability of these models against code input.
We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z) - A General Framework for Learning from Weak Supervision [93.89870459388185]
This paper introduces a general framework for learning from weak supervision (GLWS) with a novel algorithm.
Central to GLWS is an Expectation-Maximization (EM) formulation, adeptly accommodating various weak supervision sources.
We also present an advanced algorithm that significantly simplifies the EM computational demands.
arXiv Detail & Related papers (2024-02-02T21:48:50Z) - Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities [12.82645410161464]
Large Language Models (LLMs) have demonstrated remarkable performance on code-related tasks.
We evaluate whether pre-trained LLMs can detect security vulnerabilities and address the limitations of existing tools.
arXiv Detail & Related papers (2023-11-16T13:17:20Z) - Large Language Model-Powered Smart Contract Vulnerability Detection: New
Perspectives [8.524720028421447]
This paper provides a systematic analysis of the opportunities, challenges, and potential solutions of harnessing Large Language Models (LLMs) such as GPT-4.
generating more answers with higher randomness largely boosts the likelihood of producing a correct answer but inevitably leads to a higher number of false positives.
We propose an adversarial framework dubbed GPTLens that breaks the conventional one-stage detection into two synergistic stages $-$ generation and discrimination.
arXiv Detail & Related papers (2023-10-02T12:37:23Z) - Detecting Misinformation with LLM-Predicted Credibility Signals and Weak
Supervision [5.348343219992815]
This paper investigates whether language models (LLMs) can be prompted effectively with a set of 18 credibility signals to produce weak labels for each signal.
We then aggregate these potentially noisy labels using weak supervision to predict content veracity.
We demonstrate that our approach, which combines zero-shot credibility signal labeling and weak supervision, outperforms state-of-the-art LLMs on two datasets.
arXiv Detail & Related papers (2023-09-14T11:06:51Z) - Generative Modeling Helps Weak Supervision (and Vice Versa) [87.62271390571837]
We propose a model fusing weak supervision and generative adversarial networks.
It captures discrete variables in the data alongside the weak supervision derived label estimate.
It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels.
arXiv Detail & Related papers (2022-03-22T20:24:21Z) - Fine-Tuning Pre-trained Language Model with Weak Supervision: A
Contrastive-Regularized Self-Training Approach [46.76317056976196]
Fine-tuned pre-trained language models (LMs) have achieved enormous success in many natural language processing (NLP) tasks.
We study the problem of fine-tuning pre-trained LMs using only weak supervision, without any labeled data.
We develop a contrastive self-training framework, COSINE, to enable fine-tuning LMs with weak supervision.
arXiv Detail & Related papers (2020-10-15T15:55:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.