Related papers: Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning

Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning

URL: http://arxiv.org/abs/2410.12621v1
Date: Wed, 16 Oct 2024 14:40:32 GMT
Title: Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning
Authors: Ruimeng Ye, Yang Xiao, Bo Hui,
Abstract summary: Traditional alignment methods rely on human feedback to fine-tune models. Superhuman models whose outputs may surpass human understanding poses significant challenges. Recent works use weak supervisors to elicit knowledge from much stronger models.
Score: 10.752609242505953
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) continue to advance, ensuring their alignment with human values becomes increasingly critical. Traditional alignment methods heavily rely on human feedback to fine-tune models. With the emergence of superhuman models whose outputs may surpass human understanding, evaluating and aligning these models using human judgments poses significant challenges. To address the challenges, recent works use weak supervisors to elicit knowledge from much stronger models. However, there are important disanalogies between the empirical setup in the existing works and the genuine goal of alignment. We remark that existing works investigate the phenomenon of weak-to-strong generation in analogous setup (i.e., binary classification), rather than practical alignment-relevant tasks (e.g., safety). In this paper, we bridge this gap by extending weak-to-strong generation to the context of practical alignment. We empirically demonstrate the widespread phenomenon of weak-to-strong generation in three complicated alignment tasks: safety, toxicity, and legal reasoning}. Furthermore, we explore efficient strategies for improving alignment performance to enhance the quality of model outcomes. Lastly, we summarize and analyze the challenges and potential solutions in regard to specific alignment tasks, which we hope to catalyze the research progress on the topic of weak-to-strong generalization. Our code is released at https://github.com/yeruimeng/WTS.git.

Related papers

Emergent Abilities in Large Language Models: A Survey [9.50669909278749]
Large Language Models (LLMs) are leading a new technological revolution as one of the most promising research streams toward artificial general intelligence. The scaling of these models has been linked to various so-called emergent abilities that were previously unobserved. These abilities range from advanced reasoning and in-context learning to coding and problem-solving. Despite their transformative potential, emergent abilities remain poorly understood, leading to misconceptions about their definition, nature, predictability, and implications.
arXiv Detail & Related papers (2025-02-28T01:20:01Z)
Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives [52.863024096759816]
Misaligned research objectives have hindered progress in adversarial robustness research over the past decade. We argue that realigned objectives are necessary for meaningful progress in adversarial alignment.
arXiv Detail & Related papers (2025-02-17T15:28:40Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage. Models may behave unreliably due to poorly explored failure modes. causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
Debate Helps Weak-to-Strong Generalization [68.70065254564642]
We investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model. Experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment.
arXiv Detail & Related papers (2025-01-21T05:36:13Z)
The Superalignment of Superhuman Intelligence with Large Language Models [63.96120398355404]
We discuss the concept of superalignment from the learning perspective to answer this question. We highlight some key research problems in superalignment, namely, weak-to-strong generalization, scalable oversight, and evaluation. We present a conceptual framework for superalignment, which consists of three modules: an attacker which generates adversary queries trying to expose the weaknesses of a learner model; a learner which will refine itself by learning from scalable feedbacks generated by a critic model along with minimal human experts; and a critic which generates critics or explanations for a given query-response pair, with a target of improving the learner by criticizing.
arXiv Detail & Related papers (2024-12-15T10:34:06Z)
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization [68.62228569439478]
We investigate whether there exists an issue of weak-to-strong deception. We find that the deception intensifies as the capability gap between weak and strong models increases. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.
arXiv Detail & Related papers (2024-06-17T11:36:39Z)
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models [13.532180752491954]
We demonstrate a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales. The breakdown is dramatic, as models show strong fluctuations across even slight problem variations that should not affect problem solving. We take these initial observations to stimulate urgent re-assessment of the claimed capabilities of current generation of Large Language Models.
arXiv Detail & Related papers (2024-06-04T07:43:33Z)
Bayesian WeakS-to-Strong from Text Classification to Generation [14.897191979004782]
This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.
arXiv Detail & Related papers (2024-05-24T13:33:11Z)
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models [55.919653720979824]
This paper focuses on the concept of weak-to-strong generalization, which involves using a weaker model to supervise a stronger one. We introduce a novel and adaptively adjustable loss function for weak-to-strong supervision. Our approach not only exceeds the performance benchmarks set by strong-to-strong generalization but also surpasses the outcomes of fine-tuning strong models with whole datasets.
arXiv Detail & Related papers (2024-02-06T06:30:34Z)
From Instructions to Intrinsic Human Values -- A Survey of Alignment Goals for Big Models [48.326660953180145]
We conduct a survey of different alignment goals in existing work and trace their evolution paths to help identify the most essential goal. Our analysis reveals a goal transformation from fundamental abilities to value orientation, indicating the potential of intrinsic human values as the alignment goal for enhanced LLMs.
arXiv Detail & Related papers (2023-08-23T09:11:13Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
On the Opportunities and Risks of Foundation Models [256.61956234436553]
We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration.
arXiv Detail & Related papers (2021-08-16T17:50:08Z)
Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes [72.64975113835018]
Motivated by descriptive ethics, we investigate a novel, data-driven approach to machine ethics. We introduce Scruples, the first large-scale dataset with 625,000 ethical judgments over 32,000 real-life anecdotes. Our dataset presents a major challenge to state-of-the-art neural language models, leaving significant room for improvement.
arXiv Detail & Related papers (2020-08-20T17:34:15Z)
Opportunities and Challenges in Deep Learning Adversarial Robustness: A Survey [1.8782750537161614]
This paper studies strategies to implement adversary robustly trained algorithms towards guaranteeing safety in machine learning algorithms. We provide a taxonomy to classify adversarial attacks and defenses, formulate the Robust Optimization problem in a min-max setting, and divide it into 3 subcategories, namely: Adversarial (re)Training, Regularization Approach, and Certified Defenses.
arXiv Detail & Related papers (2020-07-01T21:00:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.