Related papers: From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems

From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems

URL: http://arxiv.org/abs/2511.10871v1
Date: Fri, 14 Nov 2025 00:55:28 GMT
Title: From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems
Authors: Parisa Rabbani, Nimet Beyza Bozdag, Dilek Hakkani-Tür,
Abstract summary: We investigate how an LLM's conviction is changed when a task is reframed from a direct factual query to a Conversational Judgment Task.<n>We apply pressure in the form of a simple rebuttal ("The previous answer is incorrect.") to both conditions.<n>Our findings show that while some models like GPT-4o-mini reveal sycophantic tendencies under social framing tasks, others like Llama-8B-Instruct become overly-critical.
Score: 8.8953040142657
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: LLMs are increasingly employed as judges across a variety of tasks, including those involving everyday social interactions. Yet, it remains unclear whether such LLM-judges can reliably assess tasks that require social or conversational judgment. We investigate how an LLM's conviction is changed when a task is reframed from a direct factual query to a Conversational Judgment Task. Our evaluation framework contrasts the model's performance on direct factual queries with its assessment of a speaker's correctness when the same information is presented within a minimal dialogue, effectively shifting the query from "Is this statement correct?" to "Is this speaker correct?". Furthermore, we apply pressure in the form of a simple rebuttal ("The previous answer is incorrect.") to both conditions. This perturbation allows us to measure how firmly the model maintains its position under conversational pressure. Our findings show that while some models like GPT-4o-mini reveal sycophantic tendencies under social framing tasks, others like Llama-8B-Instruct become overly-critical. We observe an average performance change of 9.24% across all models, demonstrating that even minimal dialogue context can significantly alter model judgment, underscoring conversational framing as a key factor in LLM-based evaluation. The proposed framework offers a reproducible methodology for diagnosing model conviction and contributes to the development of more trustworthy dialogue systems.

Related papers

Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation [56.84819098277464]
CoNL is a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play.<n>CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.
arXiv Detail & Related papers (2026-01-29T09:41:14Z)
DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference [6.820756409849046]
We show that third-party judges (LLMs) judge identical claims differently depending on framing.<n>We call this dialogic deference and introduce DialDefer, a framework for detecting and mitigating these framing-induced judgment shifts.<n>Our Dialogic Deference Score (DDS) captures directional shifts that aggregate accuracy obscures.
arXiv Detail & Related papers (2026-01-15T22:50:46Z)
JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation [13.831735556002426]
Small language models (SLMs) have shown promise on various reasoning tasks.<n>Their ability to judge the correctness of answers remains unclear compared to large language models (LLMs)
arXiv Detail & Related papers (2025-11-20T01:14:39Z)
VISTA Score: Verification In Sequential Turn-based Assessment [18.318681275086902]
We introduce VISTA, a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking.<n> VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements.<n>Human evaluation confirms that VISTA's decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks.
arXiv Detail & Related papers (2025-10-30T23:45:13Z)
Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z)
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts [79.1081247754018]
Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks.<n>We propose a framework based on Contact Searching Questions(CSQ) to quantify the likelihood of deception.
arXiv Detail & Related papers (2025-08-08T14:46:35Z)
Reasoning in Conversation: Solving Subjective Tasks through Dialogue Simulation for Large Language Models [56.93074140619464]
We propose RiC (Reasoning in Conversation), a method that focuses on solving subjective tasks through dialogue simulation. The motivation of RiC is to mine useful contextual information by simulating dialogues instead of supplying chain-of-thought style rationales. We evaluate both API-based and open-source LLMs including GPT-4, ChatGPT, and OpenChat across twelve tasks.
arXiv Detail & Related papers (2024-02-27T05:37:10Z)
Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models [51.75805497456226]
This work focuses on the factual consistency issue with the help of the dialogue summarization task. Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. To stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data.
arXiv Detail & Related papers (2023-11-13T09:32:12Z)
JoTR: A Joint Transformer and Reinforcement Learning Framework for Dialog Policy Learning [53.83063435640911]
Dialogue policy learning (DPL) is a crucial component of dialogue modelling. We introduce a novel framework, JoTR, to generate flexible dialogue actions. Unlike traditional methods, JoTR formulates a word-level policy that allows for a more dynamic and adaptable dialogue action generation.
arXiv Detail & Related papers (2023-09-01T03:19:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.