Related papers: Debating with More Persuasive LLMs Leads to More Truthful Answers

Debating with More Persuasive LLMs Leads to More Truthful Answers

URL: http://arxiv.org/abs/2402.06782v4
Date: Thu, 25 Jul 2024 23:32:21 GMT
Title: Debating with More Persuasive LLMs Leads to More Truthful Answers
Authors: Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez,
Abstract summary: We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively. Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.
Score: 45.0343254517401
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this question in an analogous setting, where stronger models (experts) possess the necessary information to answer questions and weaker models (non-experts) lack this information. The method we evaluate is debate, where two LLM experts each argue for a different answer, and a non-expert selects the answer. We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively (naive baselines obtain 48% and 60%). Furthermore, optimising expert debaters for persuasiveness in an unsupervised manner improves non-expert ability to identify the truth in debates. Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.

Related papers

Debating for Better Reasoning: An Unsupervised Multimodal Approach [56.74157117060815]
We extend the debate paradigm to a multimodal setting, exploring its potential for weaker models to supervise and enhance the performance of stronger models.<n>We focus on visual question answering (VQA), where two "sighted" expert vision-language models debate an answer, while a "blind" (text-only) judge adjudicates based solely on the quality of the arguments.<n>In our framework, the experts defend only answers aligned with their beliefs, thereby obviating the need for explicit role-playing and concentrating the debate on instances of expert disagreement.
arXiv Detail & Related papers (2025-05-20T17:18:17Z)
Debate Helps Weak-to-Strong Generalization [68.70065254564642]
We investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model. Experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment.
arXiv Detail & Related papers (2025-01-21T05:36:13Z)
The Superalignment of Superhuman Intelligence with Large Language Models [63.96120398355404]
We discuss the concept of superalignment from the learning perspective to answer this question. We highlight some key research problems in superalignment, namely, weak-to-strong generalization, scalable oversight, and evaluation. We present a conceptual framework for superalignment, which consists of three modules: an attacker which generates adversary queries trying to expose the weaknesses of a learner model; a learner which will refine itself by learning from scalable feedbacks generated by a critic model along with minimal human experts; and a critic which generates critics or explanations for a given query-response pair, with a target of improving the learner by criticizing.
arXiv Detail & Related papers (2024-12-15T10:34:06Z)
Teaching Models to Balance Resisting and Accepting Persuasion [69.68379406317682]
Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Balanced Training (or PBT)
arXiv Detail & Related papers (2024-10-18T16:49:36Z)
Training Language Models to Win Debates with Self-Play Improves Judge Accuracy [8.13173791334223]
We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play. We find that language model based evaluators answer questions more accurately when judging models optimized to win debates.
arXiv Detail & Related papers (2024-09-25T05:28:33Z)
On scalable oversight with weak LLMs judging strong LLMs [67.8628575615614]
We study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models.
arXiv Detail & Related papers (2024-07-05T16:29:15Z)
When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour [0.8133739801185272]
We study the suggestibility of Large Language Models to sycophantic behaviour.<n>This behaviour is known as sycophancy and depicts the tendency of LLMs to generate misleading responses.
arXiv Detail & Related papers (2023-11-15T22:18:33Z)
Debate Helps Supervise Unreliable Experts [33.03555781137954]
We show that debate between two unreliable experts can help a non-expert judge more reliably identify the truth. Comparing debate to a baseline we call consultancy, where a single expert argues for only one answer which is correct half of the time, we find that debate performs significantly better. These results show that debate is a promising approach for supervising increasingly capable but potentially unreliable AI systems.
arXiv Detail & Related papers (2023-11-15T05:05:40Z)
The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust. It asks necessary questions to decide when an LLM should refine its output. It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z)
Ask an Expert: Leveraging Language Models to Improve Strategic Reasoning in Goal-Oriented Dialogue Models [15.476899850339395]
We propose the "Ask an Expert" framework in which the model is trained with access to an "expert" which it can consult at each turn. Advice is solicited via a structured dialogue with the expert, and the model is optimized to selectively utilize (or ignore) it given the context and dialogue history. We evaluate this framework in a mental health support domain, where the structure of the expert conversation is outlined by pre-specified prompts which reflect a reasoning strategy taught to practitioners in the field.
arXiv Detail & Related papers (2023-05-29T04:19:35Z)
Getting MoRE out of Mixture of Language Model Reasoning Experts [71.61176122960464]
We propose a Mixture-of-Reasoning-Experts (MoRE) framework that ensembles diverse specialized language models. We specialize the backbone language model with prompts optimized for different reasoning categories, including factual, multihop, mathematical, and commonsense reasoning. Our human study confirms that presenting expert predictions and the answer selection process helps annotators more accurately calibrate when to trust the system's output.
arXiv Detail & Related papers (2023-05-24T02:00:51Z)
Are Metrics Enough? Guidelines for Communicating and Visualizing Predictive Models to Subject Matter Experts [7.768301998812552]
We describe an iterative study conducted with both subject matter experts and data scientists to understand the gaps in communication. We derive a set of communication guidelines that use visualization as a common medium for communicating the strengths and weaknesses of a model.
arXiv Detail & Related papers (2022-05-11T19:40:24Z)
Generative Context Pair Selection for Multi-hop Question Answering [60.74354009152721]
We propose a generative context selection model for multi-hop question answering. Our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set.
arXiv Detail & Related papers (2021-04-18T07:00:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.