Related papers: DebateBench: A Challenging Long Context Reasoning Benchmark For Large Language Models

DebateBench: A Challenging Long Context Reasoning Benchmark For Large Language Models

URL: http://arxiv.org/abs/2502.06279v1
Date: Mon, 10 Feb 2025 09:23:03 GMT
Title: DebateBench: A Challenging Long Context Reasoning Benchmark For Large Language Models
Authors: Utkarsh Tiwari, Aryan Seth, Adi Mukherjee, Kaavya Mer, Kavish, Dhruv Kumar,
Abstract summary: We introduce DebateBench, a novel dataset consisting of an extensive collection of transcripts and metadata from some of the world's most prestigious competitive debates.<n>The dataset consists of British Parliamentary debates from prestigious debating tournaments on diverse topics, annotated with detailed speech-level scores and house rankings sourced from official adjudication data.<n>We curate 256 speeches across 32 debates with each debate being over 1 hour long with each input being an average of 32,000 tokens.
Score: 1.8197265299982013
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce DebateBench, a novel dataset consisting of an extensive collection of transcripts and metadata from some of the world's most prestigious competitive debates. The dataset consists of British Parliamentary debates from prestigious debating tournaments on diverse topics, annotated with detailed speech-level scores and house rankings sourced from official adjudication data. We curate 256 speeches across 32 debates with each debate being over 1 hour long with each input being an average of 32,000 tokens. Designed to capture long-context, large-scale reasoning tasks, DebateBench provides a benchmark for evaluating modern large language models (LLMs) on their ability to engage in argumentation, deliberation, and alignment with human experts. To do well on DebateBench, the LLMs must perform in-context learning to understand the rules and evaluation criteria of the debates, then analyze 8 seven minute long speeches and reason about the arguments presented by all speakers to give the final results. Our preliminary evaluation using GPT o1, GPT-4o, and Claude Haiku, shows that LLMs struggle to perform well on DebateBench, highlighting the need to develop more sophisticated techniques for improving their performance.

Related papers

DS@GT at Touché: Large Language Models for Retrieval-Augmented Debate [0.0]
We deploy six leading publicly available models for the Retrieval-Augmented Debate and Evaluation.<n>The evaluation is performed by measuring four key metrics: Quality, Quantity, Manner, and Relation.<n>Although LLMs perform well in debates when given related arguments, they tend to be verbose in responses yet consistent in evaluation.
arXiv Detail & Related papers (2025-07-12T00:20:00Z)
Debating for Better Reasoning: An Unsupervised Multimodal Approach [56.74157117060815]
We extend the debate paradigm to a multimodal setting, exploring its potential for weaker models to supervise and enhance the performance of stronger models.<n>We focus on visual question answering (VQA), where two "sighted" expert vision-language models debate an answer, while a "blind" (text-only) judge adjudicates based solely on the quality of the arguments.<n>In our framework, the experts defend only answers aligned with their beliefs, thereby obviating the need for explicit role-playing and concentrating the debate on instances of expert disagreement.
arXiv Detail & Related papers (2025-05-20T17:18:17Z)
Training Language Models to Win Debates with Self-Play Improves Judge Accuracy [8.13173791334223]
We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play. We find that language model based evaluators answer questions more accurately when judging models optimized to win debates.
arXiv Detail & Related papers (2024-09-25T05:28:33Z)
Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM [51.43102092480804]
Debatrix is an automated debate judge based on Large Language Models (LLMs) To align with real-world debate scenarios, we introduced the PanelBench benchmark, comparing our system's performance to actual debate outcomes. The findings indicate a notable enhancement over directly using LLMs for debate evaluation.
arXiv Detail & Related papers (2024-03-12T18:19:47Z)
Argue with Me Tersely: Towards Sentence-Level Counter-Argument Generation [62.069374456021016]
We present the ArgTersely benchmark for sentence-level counter-argument generation. We also propose Arg-LlaMA for generating high-quality counter-argument.
arXiv Detail & Related papers (2023-12-21T06:51:34Z)
SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD. Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z)
DEBACER: a method for slicing moderated debates [55.705662163385966]
Partitioning debates into blocks with the same subject is essential for understanding. We propose a new algorithm, DEBACER, which partitions moderated debates.
arXiv Detail & Related papers (2021-12-10T10:39:07Z)
DBATES: DataBase of Audio features, Text, and visual Expressions in competitive debate Speeches [2.5347738801524775]
We present a database of multimodal communication features extracted from debate speeches in the 2019 North American Universities Debate Championships (NAUDC) Feature sets were extracted from the visual (facial expression, gaze, and head pose), audio (PRAAT), and textual (word sentiment and linguistic category) modalities. We observe the fully multimodal model performs best in comparison to models trained on various compositions of modalities.
arXiv Detail & Related papers (2021-03-26T00:43:49Z)
High Quality Real-Time Structured Debate Generation [0.0]
We define debate trees and paths for generating debates while enforcing a high level structure and grammar. We leverage a large corpus of tree-structured debates that have metadata associated with each argument. Our results demonstrate the ability to generate debates in real-time on complex topics at a quality that is close to humans.
arXiv Detail & Related papers (2020-12-01T01:39:38Z)
DebateSum: A large-scale argument mining and summarization dataset [0.0]
DebateSum consists of 187,386 unique pieces of evidence with corresponding argument and extractive summaries. We train several transformer summarization models to benchmark summarization performance on DebateSum. We present a search engine for this dataset which is utilized extensively by members of the National Speech and Debate Association.
arXiv Detail & Related papers (2020-11-14T10:06:57Z)
Aspect-Controlled Neural Argument Generation [65.91772010586605]
We train a language model for argument generation that can be controlled on a fine-grained level to generate sentence-level arguments for a given topic, stance, and aspect. Our evaluation shows that our generation model is able to generate high-quality, aspect-specific arguments. These arguments can be used to improve the performance of stance detection models via data augmentation and to generate counter-arguments.
arXiv Detail & Related papers (2020-04-30T20:17:22Z)
MuTual: A Dataset for Multi-Turn Dialogue Reasoning [53.10434937685962]
MuTual is a novel dataset for Multi-Turn dialogue Reasoning. It consists of 8,860 manually annotated dialogues based on Chinese student English listening comprehension exams. We show that state-of-the-art methods only reach 71%, which is far behind the human performance of 94%.
arXiv Detail & Related papers (2020-04-09T11:42:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.