Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
Supervision
- URL: http://arxiv.org/abs/2312.09390v1
- Date: Thu, 14 Dec 2023 23:07:33 GMT
- Title: Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
Supervision
- Authors: Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo
Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan
Leike, Ilya Sutskever, Jeff Wu
- Abstract summary: Superhuman models will behave in complex ways too difficult for humans to reliably evaluate.
Can weak model supervision elicit the full capabilities of a much stronger model?
We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors.
- Score: 55.196139002977525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Widely used alignment techniques, such as reinforcement learning from human
feedback (RLHF), rely on the ability of humans to supervise model behavior -
for example, to evaluate whether a model faithfully followed instructions or
generated safe outputs. However, future superhuman models will behave in
complex ways too difficult for humans to reliably evaluate; humans will only be
able to weakly supervise superhuman models. We study an analogy to this
problem: can weak model supervision elicit the full capabilities of a much
stronger model? We test this using a range of pretrained language models in the
GPT-4 family on natural language processing (NLP), chess, and reward modeling
tasks. We find that when we naively finetune strong pretrained models on labels
generated by a weak model, they consistently perform better than their weak
supervisors, a phenomenon we call weak-to-strong generalization. However, we
are still far from recovering the full capabilities of strong models with naive
finetuning alone, suggesting that techniques like RLHF may scale poorly to
superhuman models without further work. We find that simple methods can often
significantly improve weak-to-strong generalization: for example, when
finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence
loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our
results suggest that it is feasible to make empirical progress today on a
fundamental challenge of aligning superhuman models.
Related papers
- Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization [68.62228569439478]
We investigate whether there exists an issue of weak-to-strong deception.
We find that the deception intensifies as the capability gap between weak and strong models increases.
Our work highlights the urgent need to pay more attention to the true reliability of superalignment.
arXiv Detail & Related papers (2024-06-17T11:36:39Z) - A statistical framework for weak-to-strong generalization [38.55982453315567]
It is unclear whether it is possible to align (stronger) LLMs with superhuman capabilities with (weaker) human feedback without degrading their capabilities.
This is an instance of the weak-to-strong generalization problem: using weaker (less capable) feedback to train a stronger (more capable) model.
We prove that weak-to-strong generalization is possible by eliciting latent knowledge from pre-trained LLMs.
arXiv Detail & Related papers (2024-05-25T13:54:05Z) - Quantifying the Gain in Weak-to-Strong Generalization [14.453654853392619]
We show that the improvement in performance achieved by strong models over their weaker counterparts is quantified by the misfit error incurred by the strong model on labels generated by the weaker model.
For instance, we can predict the amount by which the strong model will improve over the weak model, and also choose among different weak models to train the strong model, based on its misfit error.
arXiv Detail & Related papers (2024-05-24T00:14:16Z) - Vision Superalignment: Weak-to-Strong Generalization for Vision
Foundation Models [55.919653720979824]
This paper focuses on the concept of weak-to-strong generalization, which involves using a weaker model to supervise a stronger one.
We introduce a novel and adaptively adjustable loss function for weak-to-strong supervision.
Our approach not only exceeds the performance benchmarks set by strong-to-strong generalization but also surpasses the outcomes of fine-tuning strong models with whole datasets.
arXiv Detail & Related papers (2024-02-06T06:30:34Z) - Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness [52.9493817508055]
We propose Pre-trained Model Guided Adversarial Fine-Tuning (PMG-AFT) to enhance the model's zero-shot adversarial robustness.
Our approach consistently improves clean accuracy by an average of 8.72%.
arXiv Detail & Related papers (2024-01-09T04:33:03Z) - SuperHF: Supervised Iterative Learning from Human Feedback [20.22920163075946]
We focus on two prevalent methods used to align large language models, Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF)
We propose a novel approach, Supervised Iterative Learning from Human Feedback (SuperHF), which seeks to leverage the strengths of both methods.
Our experimental results show SuperHF exceeds PPO-based RLHF on the training objective, easily and favorably trades off high reward with low reward hacking, improves downstream calibration, and performs the same on our GPT-4 based qualitative evaluation scheme all the while being significantly simpler to implement.
arXiv Detail & Related papers (2023-10-25T16:52:00Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z) - Your Autoregressive Generative Model Can be Better If You Treat It as an
Energy-Based One [83.5162421521224]
We propose a unique method termed E-ARM for training autoregressive generative models.
E-ARM takes advantage of a well-designed energy-based learning objective.
We show that E-ARM can be trained efficiently and is capable of alleviating the exposure bias problem.
arXiv Detail & Related papers (2022-06-26T10:58:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.