On scalable oversight with weak LLMs judging strong LLMs
- URL: http://arxiv.org/abs/2407.04622v2
- Date: Fri, 12 Jul 2024 16:38:12 GMT
- Title: On scalable oversight with weak LLMs judging strong LLMs
- Authors: Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, Rohin Shah,
- Abstract summary: We study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions.
We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models.
- Score: 67.8628575615614
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.
Related papers
- Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus [3.8436076642278754]
Leaderboards like Arena rank Large Language Models (LLMs) based on how well their responses align with human preferences.
We propose a novel benchmarking framework, the Language Model Council (LMC)
The LMC operates through a democratic process to: 1) formulate a test set through equal participation, 2) administer the test among council members, and 3) evaluate responses as a collective jury.
arXiv Detail & Related papers (2024-06-12T19:05:43Z) - Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM [51.43102092480804]
Debatrix is an automated debate judge based on Large Language Models (LLMs)
To align with real-world debate scenarios, we introduced the PanelBench benchmark, comparing our system's performance to actual debate outcomes.
The findings indicate a notable enhancement over directly using LLMs for debate evaluation.
arXiv Detail & Related papers (2024-03-12T18:19:47Z) - PEDANTS (Precise Evaluations of Diverse Answer Nominee Text for Skinflints): Efficient Evaluation Analysis and Benchmarking for Open-Domain Question Answering [10.367359022491181]
We provide guidelines and datasets for evaluating machine QA adopted from human QA community.
We also propose an efficient, low-resource, and interpretable QA evaluation method more stable than an exact match and neural methods.
arXiv Detail & Related papers (2024-02-17T01:56:19Z) - CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering [14.366087533102656]
Question answering (QA) can only make progress if we know if an answer is correct.
Current evaluation metrics to determine answer equivalence (AE) often do not align with human judgments.
arXiv Detail & Related papers (2024-01-24T01:30:25Z) - Promises and pitfalls of artificial intelligence for legal applications [19.8511844390731]
We argue that this claim is not supported by the current evidence.
We dive into AI's increasingly prevalent roles in three types of legal tasks.
We make recommendations for better evaluation and deployment of AI in legal contexts.
arXiv Detail & Related papers (2024-01-10T19:50:37Z) - Debate Helps Supervise Unreliable Experts [33.03555781137954]
We show that debate between two unreliable experts can help a non-expert judge more reliably identify the truth.
Comparing debate to a baseline we call consultancy, where a single expert argues for only one answer which is correct half of the time, we find that debate performs significantly better.
These results show that debate is a promising approach for supervising increasingly capable but potentially unreliable AI systems.
arXiv Detail & Related papers (2023-11-15T05:05:40Z) - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [76.21004582932268]
We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases.
We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
arXiv Detail & Related papers (2023-06-09T05:55:52Z) - Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.89346248535922]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution.
Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z) - Generative Context Pair Selection for Multi-hop Question Answering [60.74354009152721]
We propose a generative context selection model for multi-hop question answering.
Our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set.
arXiv Detail & Related papers (2021-04-18T07:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.