Related papers: Bayesian WeakS-to-Strong from Text Classification to Generation

Bayesian WeakS-to-Strong from Text Classification to Generation

URL: http://arxiv.org/abs/2406.03199v2
Date: Wed, 02 Oct 2024 08:45:32 GMT
Title: Bayesian WeakS-to-Strong from Text Classification to Generation
Authors: Ziyun Cui, Ziyang Zhang, Wen Wu, Guangzhi Sun, Chao Zhang,
Abstract summary: This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.
Score: 14.897191979004782
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Furthermore, we extend the application of WeakS-to-Strong from text classification tasks to text generation tasks where more advanced strategies are investigated for supervision. Moreover, direct preference optimization is applied to advance the student model's preference learning, beyond the basic learning framework of teacher forcing. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.

Related papers

Weak-to-Strong Generalization with Failure Trajectories: A Tree-based Approach to Elicit Optimal Policy in Strong Models [26.393123295267642]
Weak-to-Strong generalization (W2SG) is a new trend to elicit the full capabilities of a strong model with supervision from a weak model.<n>We fine-tune a strong model with trajectories of intermediate actions generated by a weak model.<n>Our empirical evaluations demonstrate substantial improvements in reasoning and decision-making capabilities across diverse task domains.
arXiv Detail & Related papers (2025-07-25T00:17:09Z)
Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning [0.42855555838080844]
This study investigates the spatial reasoning capabilities of vision-language models (VLMs) through Chain-of-Thought prompting and reinforcement learning.<n>We find that simple CoT formats, where the model generates a reasoning step before the answer, can harm the model's original performance.<n>In contrast, structured multi-stage prompting based on scene graphs (SceneGraph CoT) significantly improves spatial reasoning accuracy.
arXiv Detail & Related papers (2025-07-06T10:51:12Z)
Debate Helps Weak-to-Strong Generalization [68.70065254564642]
We investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model. Experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment.
arXiv Detail & Related papers (2025-01-21T05:36:13Z)
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning [10.752609242505953]
Traditional alignment methods rely on human feedback to fine-tune models. Superhuman models whose outputs may surpass human understanding poses significant challenges. Recent works use weak supervisors to elicit knowledge from much stronger models.
arXiv Detail & Related papers (2024-10-16T14:40:32Z)
Weak-to-Strong Reasoning [33.20094938292376]
We introduce a progressive learning framework that enables the strong model to autonomously refine its training data. Our method significantly enhances the reasoning capabilities of Llama2-70b using three separate weak models. This work paves the way for a more scalable and sophisticated strategy to enhance AI reasoning powers.
arXiv Detail & Related papers (2024-07-18T16:25:17Z)
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models [55.919653720979824]
This paper focuses on the concept of weak-to-strong generalization, which involves using a weaker model to supervise a stronger one. We introduce a novel and adaptively adjustable loss function for weak-to-strong supervision. Our approach not only exceeds the performance benchmarks set by strong-to-strong generalization but also surpasses the outcomes of fine-tuning strong models with whole datasets.
arXiv Detail & Related papers (2024-02-06T06:30:34Z)
Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning [21.401598876308345]
This paper presents a follow-up study to OpenAI's recent superalignment work on Weak-to-Strong Generalization (W2SG) Superalignment focuses on ensuring that high-level AI systems remain consistent with human values and intentions when dealing with complex, high-risk tasks. Our study simulates two phases of superalignment under the W2SG framework: the development of general superhuman models and the progression towards superintelligence.
arXiv Detail & Related papers (2024-02-01T15:30:19Z)
Learn from the Past: A Proxy Guided Adversarial Defense Framework with Self Distillation Regularization [53.04697800214848]
Adversarial Training (AT) is pivotal in fortifying the robustness of deep learning models. AT methods, relying on direct iterative updates for target model's defense, frequently encounter obstacles such as unstable training and catastrophic overfitting. We present a general proxy guided defense framework, LAST' (bf Learn from the Pbf ast)
arXiv Detail & Related papers (2023-10-19T13:13:41Z)
On the Robustness of Aspect-based Sentiment Analysis: Rethinking Model, Data, and Training [109.9218185711916]
Aspect-based sentiment analysis (ABSA) aims at automatically inferring the specific sentiment polarities toward certain aspects of products or services behind social media texts or reviews. We propose to enhance the ABSA robustness by systematically rethinking the bottlenecks from all possible angles, including model, data, and training.
arXiv Detail & Related papers (2023-04-19T11:07:43Z)
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks. We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework. TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z)
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z)
Clustering Effect of (Linearized) Adversarial Robust Models [60.25668525218051]
We propose a novel understanding of adversarial robustness and apply it on more tasks including domain adaption and robustness boosting. Experimental evaluations demonstrate the rationality and superiority of our proposed clustering strategy.
arXiv Detail & Related papers (2021-11-25T05:51:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.