Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks
- URL: http://arxiv.org/abs/2505.23820v1
- Date: Wed, 28 May 2025 01:31:54 GMT
- Title: Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks
- Authors: Bhaktipriya Radharapu, Manon Revel, Megan Ung, Sebastian Ruder, Adina Williams,
- Abstract summary: This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater.<n>We develop a no-consensus'' benchmark by curating examples that encompass a variety of a priori ambivalent scenarios.<n>Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters.
- Score: 52.098988739649705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing use of LLMs as substitutes for humans in ``aligning'' LLMs has raised questions about their ability to replicate human judgments and preferences, especially in ambivalent scenarios where humans disagree. This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater. These roles loosely correspond to previously described alignment frameworks: preference alignment (judge) and scalable oversight (debater), with the answer generator reflecting the typical setting with user interactions. We develop a ``no-consensus'' benchmark by curating examples that encompass a variety of a priori ambivalent scenarios, each presenting two possible stances. Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters. These findings underscore the necessity for more sophisticated methods for aligning LLMs without human oversight, highlighting that LLMs cannot fully capture human disagreement even on topics where humans themselves are divided.
Related papers
- Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation [17.330188045948663]
We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges.<n>This task involves a unique set of cognitive abilities that have previously received limited attention in systematic benchmarking.<n>We leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task.
arXiv Detail & Related papers (2025-06-05T14:06:51Z) - Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments [6.270885758858811]
Large language models (LLMs) are being widely applied across various fields, but as tasks become more complex, evaluating their responses is increasingly challenging.<n>We propose a three-stage meta-judge selection pipeline: 1) developing a comprehensive rubric with GPT-4 and human experts, 2) using three advanced LLM agents to score judgments, and 3) applying a threshold to filter out low-scoring judgments.<n> Experimental results on the JudgeBench dataset show about 15.55% improvement compared to raw judgments and about 8.37% improvement over the single-agent baseline.
arXiv Detail & Related papers (2025-04-23T20:32:12Z) - Perspective Transition of Large Language Models for Solving Subjective Tasks [18.322631948136973]
Reasoning through Perspective Transition (RPT) is a method based on in-context learning that enables LLMs to dynamically select among direct, role, and third-person perspectives.<n>Our method outperforms widely used single fixed perspective based methods such as chain-of-thought prompting and expert prompting.
arXiv Detail & Related papers (2025-01-16T03:30:47Z) - Potential and Perils of Large Language Models as Judges of Unstructured Textual Data [0.631976908971572]
This research investigates the effectiveness of LLM-as-judge models to evaluate the thematic alignment of summaries generated by other LLMs.<n>Our findings reveal that while LLM-as-judge offer a scalable solution comparable to human raters, humans may still excel at detecting subtle, context-specific nuances.
arXiv Detail & Related papers (2025-01-14T14:49:14Z) - Benchmarking Bias in Large Language Models during Role-Playing [21.28427555283642]
We introduce BiasLens, a fairness testing framework designed to expose biases in Large Language Models (LLMs) during role-playing.
Our approach uses LLMs to generate 550 social roles across a comprehensive set of 11 demographic attributes, producing 33,000 role-specific questions.
Using the generated questions as the benchmark, we conduct extensive evaluations of six advanced LLMs released by OpenAI, Mistral AI, Meta, Alibaba, and DeepSeek.
Our benchmark reveals 72,716 biased responses across the studied LLMs, with individual models yielding between 7,754 and 16,963 biased responses.
arXiv Detail & Related papers (2024-11-01T13:47:00Z) - Counterfactual Debating with Preset Stances for Hallucination Elimination of LLMs [45.38821594541265]
Large Language Models (LLMs) excel in various natural language processing tasks but struggle with hallucination issues.<n>We propose a CounterFactual Multi-Agent Debate (CFMAD) framework to override LLMs' inherent biases for answer inspection.
arXiv Detail & Related papers (2024-06-17T13:21:23Z) - Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions [77.66677127535222]
Auto-Arena is an innovative framework that automates the entire evaluation process using LLM-powered agents.
In our experiments, Auto-Arena shows a 92.14% correlation with human preferences, surpassing all previous expert-annotated benchmarks.
arXiv Detail & Related papers (2024-05-30T17:19:19Z) - Sentiment Analysis through LLM Negotiations [58.67939611291001]
A standard paradigm for sentiment analysis is to rely on a singular LLM and makes the decision in a single round.
This paper introduces a multi-LLM negotiation framework for sentiment analysis.
arXiv Detail & Related papers (2023-11-03T12:35:29Z) - Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.<n>This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)<n>We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [76.21004582932268]
We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases.
We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
arXiv Detail & Related papers (2023-06-09T05:55:52Z) - Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution.
Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.