Improving Weak-to-Strong Generalization with Scalable Oversight and
Ensemble Learning
- URL: http://arxiv.org/abs/2402.00667v1
- Date: Thu, 1 Feb 2024 15:30:19 GMT
- Title: Improving Weak-to-Strong Generalization with Scalable Oversight and
Ensemble Learning
- Authors: Jitao Sang, Yuhang Wang, Jing Zhang, Yanxu Zhu, Chao Kong, Junhong Ye,
Shuyu Wei and Jinlin Xiao
- Abstract summary: This paper presents a follow-up study to OpenAI's recent superalignment work on Weak-to-Strong Generalization (W2SG)
Superalignment focuses on ensuring that high-level AI systems remain consistent with human values and intentions when dealing with complex, high-risk tasks.
Our study simulates two phases of superalignment under the W2SG framework: the development of general superhuman models and the progression towards superintelligence.
- Score: 21.401598876308345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a follow-up study to OpenAI's recent superalignment work
on Weak-to-Strong Generalization (W2SG). Superalignment focuses on ensuring
that high-level AI systems remain consistent with human values and intentions
when dealing with complex, high-risk tasks. The W2SG framework has opened new
possibilities for empirical research in this evolving field. Our study
simulates two phases of superalignment under the W2SG framework: the
development of general superhuman models and the progression towards
superintelligence. In the first phase, based on human supervision, the quality
of weak supervision is enhanced through a combination of scalable oversight and
ensemble learning, reducing the capability gap between weak teachers and strong
students. In the second phase, an automatic alignment evaluator is employed as
the weak supervisor. By recursively updating this auto aligner, the
capabilities of the weak teacher models are synchronously enhanced, achieving
weak-to-strong supervision over stronger student models.We also provide an
initial validation of the proposed approach for the first phase. Using the SciQ
task as example, we explore ensemble learning for weak teacher models through
bagging and boosting. Scalable oversight is explored through two auxiliary
settings: human-AI interaction and AI-AI debate. Additionally, the paper
discusses the impact of improved weak supervision on enhancing weak-to-strong
generalization based on in-context learning. Experiment code and dataset will
be released at https://github.com/ADaM-BJTU/W2SG.
Related papers
- Reinforcement Learning under Latent Dynamics: Toward Statistical and Algorithmic Modularity [51.40558987254471]
Real-world applications of reinforcement learning often involve environments where agents operate on complex, high-dimensional observations.
This paper addresses the question of reinforcement learning under $textitgeneral$ latent dynamics from a statistical and algorithmic perspective.
arXiv Detail & Related papers (2024-10-23T14:22:49Z) - Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning [10.752609242505953]
Traditional alignment methods rely on human feedback to fine-tune models.
Superhuman models whose outputs may surpass human understanding poses significant challenges.
Recent works use weak supervisors to elicit knowledge from much stronger models.
arXiv Detail & Related papers (2024-10-16T14:40:32Z) - EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM? [28.43206274079919]
We propose an innovative approach to weak-to-strong (w2s) generalization.
We show that weak models trained on simpler tasks collaboratively supervise stronger models on more complex tasks.
We observe an improvement of up to 14% over existing baselines and average improvements of 5% and 4% for binary classification and generative tasks.
arXiv Detail & Related papers (2024-10-06T18:06:42Z) - Bayesian WeakS-to-Strong from Text Classification to Generation [14.897191979004782]
This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions.
Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization.
Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.
arXiv Detail & Related papers (2024-05-24T13:33:11Z) - Co-Supervised Learning: Improving Weak-to-Strong Generalization with
Hierarchical Mixture of Experts [81.37287967870589]
We propose to harness a diverse set of specialized teachers, instead of a single generalist one, that collectively supervises the strong student.
Our approach resembles the classical hierarchical mixture of experts, with two components tailored for co-supervision.
We validate the proposed method through visual recognition tasks on the OpenAI weak-to-strong benchmark and additional multi-domain datasets.
arXiv Detail & Related papers (2024-02-23T18:56:11Z) - Vision Superalignment: Weak-to-Strong Generalization for Vision
Foundation Models [55.919653720979824]
This paper focuses on the concept of weak-to-strong generalization, which involves using a weaker model to supervise a stronger one.
We introduce a novel and adaptively adjustable loss function for weak-to-strong supervision.
Our approach not only exceeds the performance benchmarks set by strong-to-strong generalization but also surpasses the outcomes of fine-tuning strong models with whole datasets.
arXiv Detail & Related papers (2024-02-06T06:30:34Z) - A General Framework for Learning from Weak Supervision [93.89870459388185]
This paper introduces a general framework for learning from weak supervision (GLWS) with a novel algorithm.
Central to GLWS is an Expectation-Maximization (EM) formulation, adeptly accommodating various weak supervision sources.
We also present an advanced algorithm that significantly simplifies the EM computational demands.
arXiv Detail & Related papers (2024-02-02T21:48:50Z) - TWINS: A Fine-Tuning Framework for Improved Transferability of
Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks.
We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework.
TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z) - Learning Dynamics and Generalization in Reinforcement Learning [59.530058000689884]
We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training.
We show that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly networks and gradient networks trained with policy methods.
arXiv Detail & Related papers (2022-06-05T08:49:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.