A statistical framework for weak-to-strong generalization
- URL: http://arxiv.org/abs/2405.16236v1
- Date: Sat, 25 May 2024 13:54:05 GMT
- Title: A statistical framework for weak-to-strong generalization
- Authors: Seamus Somerstep, Felipe Maia Polo, Moulinath Banerjee, Ya'acov Ritov, Mikhail Yurochkin, Yuekai Sun,
- Abstract summary: It is unclear whether it is possible to align (stronger) LLMs with superhuman capabilities with (weaker) human feedback without degrading their capabilities.
This is an instance of the weak-to-strong generalization problem: using weaker (less capable) feedback to train a stronger (more capable) model.
We prove that weak-to-strong generalization is possible by eliciting latent knowledge from pre-trained LLMs.
- Score: 38.55982453315567
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern large language model (LLM) alignment techniques rely on human feedback, but it is unclear whether the techniques fundamentally limit the capabilities of aligned LLMs. In particular, it is unclear whether it is possible to align (stronger) LLMs with superhuman capabilities with (weaker) human feedback without degrading their capabilities. This is an instance of the weak-to-strong generalization problem: using weaker (less capable) feedback to train a stronger (more capable) model. We prove that weak-to-strong generalization is possible by eliciting latent knowledge from pre-trained LLMs. In particular, we cast the weak-to-strong generalization problem as a transfer learning problem in which we wish to transfer a latent concept from a weak model to a strong pre-trained model. We prove that a naive fine-tuning approach suffers from fundamental limitations, but an alternative refinement-based approach suggested by the problem structure provably overcomes the limitations of fine-tuning. Finally, we demonstrate the practical applicability of the refinement approach with three LLM alignment tasks.
Related papers
- Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification [76.14641982122696]
We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control.
We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
arXiv Detail & Related papers (2024-10-07T23:38:58Z) - Your Weak LLM is Secretly a Strong Teacher for Alignment [19.33906256866585]
Existing alignment frameworks present constraints either in the form of expensive human effort or high computational costs.
This paper explores a promising middle ground, where we employ a weak LLM that is significantly less resource-intensive than top-tier models.
We show that weak LLMs can provide feedback that rivals or even exceeds that of fully human-annotated data.
arXiv Detail & Related papers (2024-09-13T13:24:52Z) - Improving Weak-to-Strong Generalization with Reliability-Aware Alignment [22.754757518792395]
Large language models (LLMs) are rapidly advancing and surpassing human abilities on many natural language tasks.
"Super-alignment" problem requires enhancing weak-to-strong generalization.
We propose an approach to improve weak-to-strong generalization by involving the reliability of weak supervision signals.
arXiv Detail & Related papers (2024-06-27T09:37:34Z) - Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization [68.62228569439478]
We investigate whether there exists an issue of weak-to-strong deception.
We find that the deception intensifies as the capability gap between weak and strong models increases.
Our work highlights the urgent need to pay more attention to the true reliability of superalignment.
arXiv Detail & Related papers (2024-06-17T11:36:39Z) - Vision Superalignment: Weak-to-Strong Generalization for Vision
Foundation Models [55.919653720979824]
This paper focuses on the concept of weak-to-strong generalization, which involves using a weaker model to supervise a stronger one.
We introduce a novel and adaptively adjustable loss function for weak-to-strong supervision.
Our approach not only exceeds the performance benchmarks set by strong-to-strong generalization but also surpasses the outcomes of fine-tuning strong models with whole datasets.
arXiv Detail & Related papers (2024-02-06T06:30:34Z) - A General Framework for Learning from Weak Supervision [93.89870459388185]
This paper introduces a general framework for learning from weak supervision (GLWS) with a novel algorithm.
Central to GLWS is an Expectation-Maximization (EM) formulation, adeptly accommodating various weak supervision sources.
We also present an advanced algorithm that significantly simplifies the EM computational demands.
arXiv Detail & Related papers (2024-02-02T21:48:50Z) - Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
Supervision [55.196139002977525]
Superhuman models will behave in complex ways too difficult for humans to reliably evaluate.
Can weak model supervision elicit the full capabilities of a much stronger model?
We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors.
arXiv Detail & Related papers (2023-12-14T23:07:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.