Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation
- URL: http://arxiv.org/abs/2506.05062v1
- Date: Thu, 05 Jun 2025 14:06:51 GMT
- Title: Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation
- Authors: Noy Sternlicht, Ariel Gera, Roy Bar-Haim, Tom Hope, Noam Slonim,
- Abstract summary: We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges.<n>This task involves a unique set of cognitive abilities that have previously received limited attention in systematic benchmarking.<n>We leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task.
- Score: 17.330188045948663
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that have previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.
Related papers
- SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models [76.07833875692722]
Speech-based Intelligence Quotient (SIQ) is a new form of human cognition-inspired evaluation pipeline for voice understanding large language models.<n>Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks.
arXiv Detail & Related papers (2025-07-25T15:12:06Z) - Pixels, Patterns, but No Poetry: To See The World like Humans [33.773551676022514]
State-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans.<n>This paper shifts focus from reasoning to perception.
arXiv Detail & Related papers (2025-07-21T21:50:16Z) - Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks [52.098988739649705]
This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater.<n>We develop a no-consensus'' benchmark by curating examples that encompass a variety of a priori ambivalent scenarios.<n>Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters.
arXiv Detail & Related papers (2025-05-28T01:31:54Z) - Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism [2.0435202333125977]
Large language models (LLMs) are increasingly used in decision-making tasks like r'esum'e screening and content moderation.<n>We evaluate the ability of four LLMs to identify nuanced ableism directed at autistic individuals.<n>Our results reveal that LLMs can identify autism-related language but often miss harmful or offensive connotations.
arXiv Detail & Related papers (2025-05-26T20:01:44Z) - Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models [49.1574468325115]
We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities.<n>Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training.<n>Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs.
arXiv Detail & Related papers (2025-05-25T08:37:55Z) - Debating for Better Reasoning: An Unsupervised Multimodal Approach [56.74157117060815]
We extend the debate paradigm to a multimodal setting, exploring its potential for weaker models to supervise and enhance the performance of stronger models.<n>We focus on visual question answering (VQA), where two "sighted" expert vision-language models debate an answer, while a "blind" (text-only) judge adjudicates based solely on the quality of the arguments.<n>In our framework, the experts defend only answers aligned with their beliefs, thereby obviating the need for explicit role-playing and concentrating the debate on instances of expert disagreement.
arXiv Detail & Related papers (2025-05-20T17:18:17Z) - Evaluating Large language models on Understanding Korean indirect Speech acts [0.6757476692230009]
This study evaluates whether current LLMs can understand the intention of an utterance by considering the given conversational context.<n> proprietary models exhibited relatively higher performance compared to open-source models.<n>Most LLMs, except for Claude3-Opus, demonstrated significantly lower performance in understanding indirect speech acts.
arXiv Detail & Related papers (2025-02-16T04:59:19Z) - Can Language Models Recognize Convincing Arguments? [12.458437450959416]
Large language models (LLMs) have raised concerns about their potential to create and propagate convincing narratives.
We study their performance in detecting convincing arguments to gain insights into their persuasive capabilities.
arXiv Detail & Related papers (2024-03-31T17:38:33Z) - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [76.21004582932268]
We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases.
We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
arXiv Detail & Related papers (2023-06-09T05:55:52Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.