Related papers: Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning

Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning

URL: http://arxiv.org/abs/2402.00667v1
Date: Thu, 1 Feb 2024 15:30:19 GMT
Title: Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning
Authors: Jitao Sang, Yuhang Wang, Jing Zhang, Yanxu Zhu, Chao Kong, Junhong Ye, Shuyu Wei and Jinlin Xiao
Abstract summary: This paper presents a follow-up study to OpenAI's recent superalignment work on Weak-to-Strong Generalization (W2SG) Superalignment focuses on ensuring that high-level AI systems remain consistent with human values and intentions when dealing with complex, high-risk tasks. Our study simulates two phases of superalignment under the W2SG framework: the development of general superhuman models and the progression towards superintelligence.
Score: 21.401598876308345
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents a follow-up study to OpenAI's recent superalignment work on Weak-to-Strong Generalization (W2SG). Superalignment focuses on ensuring that high-level AI systems remain consistent with human values and intentions when dealing with complex, high-risk tasks. The W2SG framework has opened new possibilities for empirical research in this evolving field. Our study simulates two phases of superalignment under the W2SG framework: the development of general superhuman models and the progression towards superintelligence. In the first phase, based on human supervision, the quality of weak supervision is enhanced through a combination of scalable oversight and ensemble learning, reducing the capability gap between weak teachers and strong students. In the second phase, an automatic alignment evaluator is employed as the weak supervisor. By recursively updating this auto aligner, the capabilities of the weak teacher models are synchronously enhanced, achieving weak-to-strong supervision over stronger student models.We also provide an initial validation of the proposed approach for the first phase. Using the SciQ task as example, we explore ensemble learning for weak teacher models through bagging and boosting. Scalable oversight is explored through two auxiliary settings: human-AI interaction and AI-AI debate. Additionally, the paper discusses the impact of improved weak supervision on enhancing weak-to-strong generalization based on in-context learning. Experiment code and dataset will be released at https://github.com/ADaM-BJTU/W2SG.

Related papers

Weak-to-Strong Generalization with Failure Trajectories: A Tree-based Approach to Elicit Optimal Policy in Strong Models [26.393123295267642]
Weak-to-Strong generalization (W2SG) is a new trend to elicit the full capabilities of a strong model with supervision from a weak model.<n>We fine-tune a strong model with trajectories of intermediate actions generated by a weak model.<n>Our empirical evaluations demonstrate substantial improvements in reasoning and decision-making capabilities across diverse task domains.
arXiv Detail & Related papers (2025-07-25T00:17:09Z)
EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles [32.189105375228294]
We propose a novel method aimed at improving weak experts, by training on the same limited human-level data, enabling them to generalize to complex, super-human-level tasks.<n>Our approach, called textbfEnsemW2S, employs a token-level ensemble strategy that iteratively combines multiple weak experts.<n>We extensively evaluate the generalization performance of both the ensemble of weak experts and the subsequent strong student model across in-distribution (ID) and out-of-distribution (OOD) datasets.
arXiv Detail & Related papers (2025-05-28T04:23:12Z)
Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization [69.96794098855938]
Weak-to-strong generalization (W2SG) offers a promising framework for supervising increasingly capable language models (LLMs) Traditional W2SG methods rely on passive learning, where a weak teacher provides noisy demonstrations to train a strong student. We introduce Alice, a framework that leverages complementary knowledge between teacher and student to enhance the learning process.
arXiv Detail & Related papers (2025-04-09T22:33:06Z)
How to Mitigate Overfitting in Weak-to-strong Generalization? [50.37526669608372]
Weak-to-strong generalization aims to elicit the capabilities of strong models through weak supervisors. Strong models exhibit significant overfitting in weak-to-strong generalization. We propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions.
arXiv Detail & Related papers (2025-03-06T09:32:39Z)
Reinforcement Learning under Latent Dynamics: Toward Statistical and Algorithmic Modularity [51.40558987254471]
Real-world applications of reinforcement learning often involve environments where agents operate on complex, high-dimensional observations. This paper addresses the question of reinforcement learning under $textitgeneral$ latent dynamics from a statistical and algorithmic perspective.
arXiv Detail & Related papers (2024-10-23T14:22:49Z)
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning [10.752609242505953]
Traditional alignment methods rely on human feedback to fine-tune models. Superhuman models whose outputs may surpass human understanding poses significant challenges. Recent works use weak supervisors to elicit knowledge from much stronger models.
arXiv Detail & Related papers (2024-10-16T14:40:32Z)
EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM? [28.43206274079919]
We propose an innovative approach to weak-to-strong (w2s) generalization. We show that weak models trained on simpler tasks collaboratively supervise stronger models on more complex tasks. We observe an improvement of up to 14% over existing baselines and average improvements of 5% and 4% for binary classification and generative tasks.
arXiv Detail & Related papers (2024-10-06T18:06:42Z)
Bayesian WeakS-to-Strong from Text Classification to Generation [14.897191979004782]
This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.
arXiv Detail & Related papers (2024-05-24T13:33:11Z)
Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts [81.37287967870589]
We propose to harness a diverse set of specialized teachers, instead of a single generalist one, that collectively supervises the strong student. Our approach resembles the classical hierarchical mixture of experts, with two components tailored for co-supervision. We validate the proposed method through visual recognition tasks on the OpenAI weak-to-strong benchmark and additional multi-domain datasets.
arXiv Detail & Related papers (2024-02-23T18:56:11Z)
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models [55.919653720979824]
This paper focuses on the concept of weak-to-strong generalization, which involves using a weaker model to supervise a stronger one. We introduce a novel and adaptively adjustable loss function for weak-to-strong supervision. Our approach not only exceeds the performance benchmarks set by strong-to-strong generalization but also surpasses the outcomes of fine-tuning strong models with whole datasets.
arXiv Detail & Related papers (2024-02-06T06:30:34Z)
A General Framework for Learning from Weak Supervision [93.89870459388185]
This paper introduces a general framework for learning from weak supervision (GLWS) with a novel algorithm. Central to GLWS is an Expectation-Maximization (EM) formulation, adeptly accommodating various weak supervision sources. We also present an advanced algorithm that significantly simplifies the EM computational demands.
arXiv Detail & Related papers (2024-02-02T21:48:50Z)
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks. We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework. TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z)
Learning Dynamics and Generalization in Reinforcement Learning [59.530058000689884]
We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training. We show that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly networks and gradient networks trained with policy methods.
arXiv Detail & Related papers (2022-06-05T08:49:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.