Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM
- URL: http://arxiv.org/abs/2403.08010v3
- Date: Wed, 19 Jun 2024 19:39:42 GMT
- Title: Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM
- Authors: Jingcong Liang, Rong Ye, Meng Han, Ruofei Lai, Xinyu Zhang, Xuanjing Huang, Zhongyu Wei,
- Abstract summary: Debatrix is an automated debate judge based on Large Language Models (LLMs)
To align with real-world debate scenarios, we introduced the PanelBench benchmark, comparing our system's performance to actual debate outcomes.
The findings indicate a notable enhancement over directly using LLMs for debate evaluation.
- Score: 51.43102092480804
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: How can we construct an automated debate judge to evaluate an extensive, vibrant, multi-turn debate? This task is challenging, as judging a debate involves grappling with lengthy texts, intricate argument relationships, and multi-dimensional assessments. At the same time, current research mainly focuses on short dialogues, rarely touching upon the evaluation of an entire debate. In this paper, by leveraging Large Language Models (LLMs), we propose Debatrix, which makes the analysis and assessment of multi-turn debates more aligned with majority preferences. Specifically, Debatrix features a vertical, iterative chronological analysis and a horizontal, multi-dimensional evaluation collaboration. To align with real-world debate scenarios, we introduced the PanelBench benchmark, comparing our system's performance to actual debate outcomes. The findings indicate a notable enhancement over directly using LLMs for debate evaluation. Source code and benchmark data are available online at https://github.com/ljcleo/debatrix .
Related papers
- On scalable oversight with weak LLMs judging strong LLMs [67.8628575615614]
We study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions.
We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models.
arXiv Detail & Related papers (2024-07-05T16:29:15Z) - Peer Review as A Multi-Turn and Long-Context Dialogue with Role-Based Interactions [62.0123588983514]
Large Language Models (LLMs) have demonstrated wide-ranging applications across various fields.
We reformulate the peer-review process as a multi-turn, long-context dialogue, incorporating distinct roles for authors, reviewers, and decision makers.
We construct a comprehensive dataset containing over 26,841 papers with 92,017 reviews collected from multiple sources.
arXiv Detail & Related papers (2024-06-09T08:24:17Z) - Argue with Me Tersely: Towards Sentence-Level Counter-Argument
Generation [62.069374456021016]
We present the ArgTersely benchmark for sentence-level counter-argument generation.
We also propose Arg-LlaMA for generating high-quality counter-argument.
arXiv Detail & Related papers (2023-12-21T06:51:34Z) - DebateKG: Automatic Policy Debate Case Creation with Semantic Knowledge
Graphs [0.0]
We show that effective debate cases can be constructed using constrained shortest path traversals on Argumentative Semantic Knowledge Graphs.
We significantly improve upon DebateSum by introducing 53180 new examples.
We create a unique method for evaluating which knowledge graphs are better in the context of producing policy debate cases.
arXiv Detail & Related papers (2023-07-09T04:19:19Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Explaining Image Classification with Visual Debates [26.76139301708958]
We propose a novel debate framework for understanding and explaining a continuous image classifier's reasoning for making a particular prediction.
Our framework encourages players to put forward diverse arguments during the debates, picking up the reasoning trails missed by their opponents.
We demonstrate and evaluate (a practical realization) our Visual Debates on the geometric SHAPE and MNIST datasets.
arXiv Detail & Related papers (2022-10-17T12:35:52Z) - DEBACER: a method for slicing moderated debates [55.705662163385966]
Partitioning debates into blocks with the same subject is essential for understanding.
We propose a new algorithm, DEBACER, which partitions moderated debates.
arXiv Detail & Related papers (2021-12-10T10:39:07Z) - High Quality Real-Time Structured Debate Generation [0.0]
We define debate trees and paths for generating debates while enforcing a high level structure and grammar.
We leverage a large corpus of tree-structured debates that have metadata associated with each argument.
Our results demonstrate the ability to generate debates in real-time on complex topics at a quality that is close to humans.
arXiv Detail & Related papers (2020-12-01T01:39:38Z) - DebateSum: A large-scale argument mining and summarization dataset [0.0]
DebateSum consists of 187,386 unique pieces of evidence with corresponding argument and extractive summaries.
We train several transformer summarization models to benchmark summarization performance on DebateSum.
We present a search engine for this dataset which is utilized extensively by members of the National Speech and Debate Association.
arXiv Detail & Related papers (2020-11-14T10:06:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.