Debating with More Persuasive LLMs Leads to More Truthful Answers
- URL: http://arxiv.org/abs/2402.06782v4
- Date: Thu, 25 Jul 2024 23:32:21 GMT
- Title: Debating with More Persuasive LLMs Leads to More Truthful Answers
- Authors: Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez,
- Abstract summary: We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively.
Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.
- Score: 45.0343254517401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this question in an analogous setting, where stronger models (experts) possess the necessary information to answer questions and weaker models (non-experts) lack this information. The method we evaluate is debate, where two LLM experts each argue for a different answer, and a non-expert selects the answer. We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively (naive baselines obtain 48% and 60%). Furthermore, optimising expert debaters for persuasiveness in an unsupervised manner improves non-expert ability to identify the truth in debates. Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.
Related papers
- Debate Helps Weak-to-Strong Generalization [68.70065254564642]
We investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision.
We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model.
Experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment.
arXiv Detail & Related papers (2025-01-21T05:36:13Z) - The Superalignment of Superhuman Intelligence with Large Language Models [63.96120398355404]
We discuss the concept of superalignment from the learning perspective to answer this question.
We highlight some key research problems in superalignment, namely, weak-to-strong generalization, scalable oversight, and evaluation.
We present a conceptual framework for superalignment, which consists of three modules: an attacker which generates adversary queries trying to expose the weaknesses of a learner model; a learner which will refine itself by learning from scalable feedbacks generated by a critic model along with minimal human experts; and a critic which generates critics or explanations for a given query-response pair, with a target of improving the learner by criticizing.
arXiv Detail & Related papers (2024-12-15T10:34:06Z) - Teaching Models to Balance Resisting and Accepting Persuasion [69.68379406317682]
We show that Persuasion-Training (or PBT) can balance positive and negative persuasion.
PBT allows us to use data generated from dialogues between smaller 7-8B models for training much larger 70B models.
We find that PBT leads to better and more stable results and less order dependence.
arXiv Detail & Related papers (2024-10-18T16:49:36Z) - Training Language Models to Win Debates with Self-Play Improves Judge Accuracy [8.13173791334223]
We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play.
We find that language model based evaluators answer questions more accurately when judging models optimized to win debates.
arXiv Detail & Related papers (2024-09-25T05:28:33Z) - On scalable oversight with weak LLMs judging strong LLMs [67.8628575615614]
We study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions.
We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models.
arXiv Detail & Related papers (2024-07-05T16:29:15Z) - Debate Helps Supervise Unreliable Experts [33.03555781137954]
We show that debate between two unreliable experts can help a non-expert judge more reliably identify the truth.
Comparing debate to a baseline we call consultancy, where a single expert argues for only one answer which is correct half of the time, we find that debate performs significantly better.
These results show that debate is a promising approach for supervising increasingly capable but potentially unreliable AI systems.
arXiv Detail & Related papers (2023-11-15T05:05:40Z) - Ask an Expert: Leveraging Language Models to Improve Strategic Reasoning
in Goal-Oriented Dialogue Models [15.476899850339395]
We propose the "Ask an Expert" framework in which the model is trained with access to an "expert" which it can consult at each turn.
Advice is solicited via a structured dialogue with the expert, and the model is optimized to selectively utilize (or ignore) it given the context and dialogue history.
We evaluate this framework in a mental health support domain, where the structure of the expert conversation is outlined by pre-specified prompts which reflect a reasoning strategy taught to practitioners in the field.
arXiv Detail & Related papers (2023-05-29T04:19:35Z) - Getting MoRE out of Mixture of Language Model Reasoning Experts [71.61176122960464]
We propose a Mixture-of-Reasoning-Experts (MoRE) framework that ensembles diverse specialized language models.
We specialize the backbone language model with prompts optimized for different reasoning categories, including factual, multihop, mathematical, and commonsense reasoning.
Our human study confirms that presenting expert predictions and the answer selection process helps annotators more accurately calibrate when to trust the system's output.
arXiv Detail & Related papers (2023-05-24T02:00:51Z) - Are Metrics Enough? Guidelines for Communicating and Visualizing
Predictive Models to Subject Matter Experts [7.768301998812552]
We describe an iterative study conducted with both subject matter experts and data scientists to understand the gaps in communication.
We derive a set of communication guidelines that use visualization as a common medium for communicating the strengths and weaknesses of a model.
arXiv Detail & Related papers (2022-05-11T19:40:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.